Sorting Surnames - especially prefixed Surnames
Posted: 26 Nov 2022 16:52
Various threads have pondered issues with surname sorting and prefixes:
Surname prefix (SPFX) -- more generally, handling structured names. (20719)
How to sort results by surname when using Work with Data option? (21197)
Rather than repeat all the above, I want to ponder how we think we should actually use FH (including the Given Name, Surname Prefix and Surname Fields (GIVN, SPFX and SURN - accessed via the All tab), as well as the more "conventional" (?) Name and Slashed Surname constructions (NAME and NAME:SURNAME)
FH conventionally sorts on "surname" (being the stuff between the slashes - NAME:SURNAME) and then the "given names" (being the stuff outside the slashes - both before and after the slashes - NAME:GIVEN_ALL). How this happens for prefixed surnames (Jacqueline du Pré, Vincent Van Gogh, Ursula von de Leyen, Ludvig van Beethoven, Wikus van de Merwe, Ronan O'Gara, etc.) is more problematic; rules swiftly become complex and there always seem to be exceptions.
The issue I am trying to get my mind around is what accountants call "substance over form".
Aside: [In the accounting world there were debates about "off balance sheet financing" - whereby companies could keep their debt levels down (and credit rating up). Instead of borrowing to invest they would lease assets and the lease payments went through the profit and loss account as an expense. Technically the "form" of the transactions was being acknowledged, but the "substance" was that both methods incurred similar forward levels of obligation to make future payments (leasing charges or debt interest and repayments) and these needed to be "properly and consistently" reflected in accounts. Similar thoughts applied to Final Salary Pension schemes. Showing the "substance" of these obligations so frightened companies that many closed their "defined benefit" pension schemes.]
In terms of names, all the above examples can follow "correct form" and can be "chopped" into "Surname Prefix" (du, van, von, von de, van de, O' etc.) and put in the SPFX field and "Surname" put in the SURN field. Mike Tate has written a plug-in that will do this chopping.
But the effect of doing that does not reflect the substance, the realities. We can cite (and argue/debate) various examples but purely as an illustration, I offer (in terms of "name", "sorted as"):
The question then is how to reflect the "substance" of names, so that they sort correctly, form correct sentences where we are referring to people by their "surnames" ("... the Beethovens lived in Bonn ..."), and capitalise correctly when required ("von de LEYEN"?).
Elsewhere I have pondered whether we should recognise "John /Smith/", "Jacqueline /du Pré/" as being "secondary sort/primary sort/" (substance) rather than "given/surname/" (form).
For NAME we might enter:
This situation is further complicated if we try to bring the GEDCOM GIVN, SPFX and SURN fields into play. I am trying to formulate a wish list item to do this. The interface element is relatively easy to specify; the background processing requirements could get very complex - particularly if we try to respect substance over form.
What then happens when using GIVN, SPFX and SURN in preference to NAME and do we need to maintain alignment? FH works off NAME so ideally we need to represent GIVN, SPFX and SURN as NAME - by some form of concatenation and correct positioning of slashes in order to ensure that things like the Records Window "work".
Trying to recognise "substance" over "form" might we have [GIVN SPFX SURN]:
Aligning NAME to match input GIVN, SPFX and SURN for the above examples is relatively easy:
Trying to programmatically (say as a result of a wish list item) align GIVN SPFX and SURN with an input "Given/Surname/" NAME construction is more difficult because we have no way to recognise when there is a "non sorting" surname prefix outside the slashes - that is a consequence of allowing substance to trump form!
There are rules; some say that in French it matters if the prefix is an article (e.g. Le, L') in which case it is part of the sort, or a preposition (e.g. de, du) in which case it is not normally part of the sort); some say in Afrikaans the prefix is normally always part of the sort. it's the "normally" that comes back to bite us. The more complex the rules the less likely we are to have "abnormal" surnames which would require manual edits.
The temptation in writing a wish list item is to say:
Surname prefix (SPFX) -- more generally, handling structured names. (20719)
How to sort results by surname when using Work with Data option? (21197)
Rather than repeat all the above, I want to ponder how we think we should actually use FH (including the Given Name, Surname Prefix and Surname Fields (GIVN, SPFX and SURN - accessed via the All tab), as well as the more "conventional" (?) Name and Slashed Surname constructions (NAME and NAME:SURNAME)
FH conventionally sorts on "surname" (being the stuff between the slashes - NAME:SURNAME) and then the "given names" (being the stuff outside the slashes - both before and after the slashes - NAME:GIVEN_ALL). How this happens for prefixed surnames (Jacqueline du Pré, Vincent Van Gogh, Ursula von de Leyen, Ludvig van Beethoven, Wikus van de Merwe, Ronan O'Gara, etc.) is more problematic; rules swiftly become complex and there always seem to be exceptions.
The issue I am trying to get my mind around is what accountants call "substance over form".
Aside: [In the accounting world there were debates about "off balance sheet financing" - whereby companies could keep their debt levels down (and credit rating up). Instead of borrowing to invest they would lease assets and the lease payments went through the profit and loss account as an expense. Technically the "form" of the transactions was being acknowledged, but the "substance" was that both methods incurred similar forward levels of obligation to make future payments (leasing charges or debt interest and repayments) and these needed to be "properly and consistently" reflected in accounts. Similar thoughts applied to Final Salary Pension schemes. Showing the "substance" of these obligations so frightened companies that many closed their "defined benefit" pension schemes.]
In terms of names, all the above examples can follow "correct form" and can be "chopped" into "Surname Prefix" (du, van, von, von de, van de, O' etc.) and put in the SPFX field and "Surname" put in the SURN field. Mike Tate has written a plug-in that will do this chopping.
But the effect of doing that does not reflect the substance, the realities. We can cite (and argue/debate) various examples but purely as an illustration, I offer (in terms of "name", "sorted as"):
- Ludvig van Beethoven > Beethoven, Ludvig van
- Ursula von de Leyen > von de Leyen, Ursula
- Jacqueline du Pré > du Pré, Jacqueline
- Simone de Beauvoir > Beauvoir, Simone de
- Ulrich Le Pen > Le Pen, Ulrich
- Vincent Van Gogh > Van Gogh, Vincent
- Ronan O'Gara > O'Gara, Ronan
The question then is how to reflect the "substance" of names, so that they sort correctly, form correct sentences where we are referring to people by their "surnames" ("... the Beethovens lived in Bonn ..."), and capitalise correctly when required ("von de LEYEN"?).
Elsewhere I have pondered whether we should recognise "John /Smith/", "Jacqueline /du Pré/" as being "secondary sort/primary sort/" (substance) rather than "given/surname/" (form).
For NAME we might enter:
- Ludvig van Beethoven (Beethoven, Ludvig van) as Ludvig van /Beethoven/
- Ursula von de Leyen (von de Leyen, Ursula) as Ursula /von de Leyen/
- Jacqueline du Pré (du Pré, Jacqueline) as Jacqueline /du Pré/
- Simone de Beauvoir (Beauvoir, Simone de) as Simone de /Beauvoir/
- Ulrich Le Pen (Le Pen, Ulrich) as Ulrich /Le Pen/
- Vincent Van Gogh ( Van Gogh, Vincent) as Vincent /Van Gogh/
- Ronan O'Gara (O'Gara, Ronan) as Ronan /O'Gara/
This situation is further complicated if we try to bring the GEDCOM GIVN, SPFX and SURN fields into play. I am trying to formulate a wish list item to do this. The interface element is relatively easy to specify; the background processing requirements could get very complex - particularly if we try to respect substance over form.
What then happens when using GIVN, SPFX and SURN in preference to NAME and do we need to maintain alignment? FH works off NAME so ideally we need to represent GIVN, SPFX and SURN as NAME - by some form of concatenation and correct positioning of slashes in order to ensure that things like the Records Window "work".
Trying to recognise "substance" over "form" might we have [GIVN SPFX SURN]:
- Ludvig van Beethoven (Beethoven, Ludvig van) as Ludvig van Beethoven
- Ursula von de Leyen (von de Leyen, Ursula) as Ursula von de Leyen
- Jacqueline du Pré (du Pré, Jacqueline) as Jacqueline du Pré
- Simone de Beauvoir (Beauvoir, Simone de) as Simone de Beauvoir
- Ulrich Le Pen (Le Pen, Ulrich) as Ulrich Le Pen
- Vincent Van Gogh ( Van Gogh, Vincent) as Vincent Van Gogh
- Ronan O'Gara (O'Gara, Ronan) as Ronan O'Gara
Aligning NAME to match input GIVN, SPFX and SURN for the above examples is relatively easy:
- NAME = GIVN + SPFX + "/" + SURN + "/" (with appropriate spacing)
Trying to programmatically (say as a result of a wish list item) align GIVN SPFX and SURN with an input "Given/Surname/" NAME construction is more difficult because we have no way to recognise when there is a "non sorting" surname prefix outside the slashes - that is a consequence of allowing substance to trump form!
There are rules; some say that in French it matters if the prefix is an article (e.g. Le, L') in which case it is part of the sort, or a preposition (e.g. de, du) in which case it is not normally part of the sort); some say in Afrikaans the prefix is normally always part of the sort. it's the "normally" that comes back to bite us. The more complex the rules the less likely we are to have "abnormal" surnames which would require manual edits.
The temptation in writing a wish list item is to say:
- here are the requirements for an interface (discussed elsewhere although my thinking has evolved a bit since then.)
- here are the requirements to convert input GIVN, SPFX and SURN fields into a workable NAME field (following substance - primary sort/secondary sort - rather than strict form), and
- converting the other way does not really matter and can be done by manual edit, or by a coarse conversion (e.g. all uncapitalised short words immediately before the opening slash are "surname prefix") followed by manual review