What is wrong with the FH import from FTM
Posted: 26 Dec 2015 15:40
I have today posted a plugin I have developed to sort the problems with FTM imports. This is on the plugin forum. This post examines the problems which it addresses.
Introduction
There are two main routes to get data from Ancestry into Family Historian (v6) (FH) – either directly from a GEDCOM file exported from the Ancestry website, or via Family Tree Maker (2014) (FTM). The latter is route is greatly to be preferred as it brings a complete set of images not only for photographs uploaded to the site, but also images of all the documents cited in Sources where Ancestry has one.
However there are numerous faults with the import, which fall into two categories:-
• Serious – involving apparent loss of data for example truncation of text in various places, generating a worryingly large number of UDFs.
• Stylistic – although data is not lost, operation in FH is less than ideal, for example descriptions being often far too long to display in the single line presented in the user interface (UI).
In spite of these shortcomings, having the media imported makes the FTM route by far the least-worst method in my book. And FH has two great advantages over other programs. As the GEDCOM is the data file for FH you can open it and see exactly what is going on, and with Plugins any problems found can be fixed up. In fact nothing that I can see has actually been lost in the import – it is still in the GEDCOM data in some form. FH just can’t (or won’t?) find it.
Much more extensive loss of data occurs with other software such as Roots Magic, and there is nothing you can do about it (short of decompiling their program).
This document analyses the faults with the FH import with a view to correcting them.
The FindUDF plugin (from the Plugin Store) has been invaluable in highlighting the various issues arising. I am now at the stage where my plugin is completely eliminating them in my own files, though I dare say problems might arise with other people’s different styles of working.
Serious Issues
Truncation of text occurs for more than one reason (1-4 below):-
1. For inline NOTEs any CONC extensions are stored as siblings of the main record rather than as children, and therefore are not displayed:-
1 RESI
2 DATE 2 APR 1911
2 PLAC Brighton, Sussex, England
2 SOUR @S170@
3 PAGE Class: RG14; Piece: 5091; Schedule Number: 152
3 OBJE @O456@
2 NOTE Relation to Head of House: Son. 19 Cavendish Street. Father a lodging
2 CONC house proprietor.
The CONC is understandably not displayed in FH. The original GEDCOM reads:-
1 RESI Relation to Head of House: Son. 19 Cavendish Street. Father a lodging
2 CONC house proprietor.
2 DATE 02 APR 1911
2 PLAC Brighton, Sussex, England
2 SOUR @S170@
3 PAGE Class: RG14; Piece: 5091; Schedule Number: 152
3 OBJE @M456@
FH has “corrected” the illegal inline text for RESI; however the CONC extension was previously at the correct level. (Actually the level number is not changed by import although it should have been.)
Solution: Read both tags directly, concatenate them, delete the CONCs and write the full text back to the NOTE.
NB This problem does not occur in referenced NOTEs.
2. SOURce PAGEs from citations (Where in Source) with CONC extensions. The extension is not displayed, although in this case it is stored with the correct level. This is presumably because FH believes that PAGE records have no business having a CONC.
Solution: Although not the same problem as for NOTEs, the same solution works.
3. _ATTR records which FH has made from FTM Custom Events. This one is very strange! In these cases the CONC extensions are of the correct level, but separated from the record to which they refer:-
1 _ATTR Newburgh Estate Valuation of the Township of Oulston 1849 by Mr H
2 TYPE Farm Valuation
2 DATE 1849
2 PLAC Oulston, Coxwold, N Yorks
2 OBJE @O99@
2 OBJE @O205@
2 OBJE @O267@
2 OBJE @O319@
2 CONC Scott and Thomas Bradley
The FTM original in this case reads:-
1 EVEN Newburgh Estate Valuation of the Township of Oulston 1849 by Mr H
2 CONC Scott and Thomas Bradley
2 TYPE Farm Valuation
2 DATE 1849
2 PLAC Oulston, Coxwold, N Yorks
2 OBJE @M99@
2 OBJE @M205@
2 OBJE @M267@
2 OBJE @M319@
Solution: Read the main record. Cycle through all the children, collecting all CONCs at the correct level. If any are found, then delete the CONC records. Add the collected CONCs to the main record and update it.
4. OCCU Occupation records – CONC text is again separated from the main record. RELI EDUC and PROP records also behaves in this way.
1 OCCU Farmer [his father makes reference to "my son William's farm" in his
2 DATE 1834
2 PLAC Yorkshire
2 CONC Will]
FTM original was:-
1 OCCU Farmer [his father makes reference to "my son William's farm" in his
2 CONC Will]
2 DATE 1834
2 PLAC Yorkshire
Solution: As for 4 above
5. SOURCes from FTM no longer display their linked images. A source generally has both referenced content AND inline content. The Referenced content is general, referring to the whole of a source document, whereas the inline content is specific to the current citation. FH looks only at the referenced source for media and ignores the media reference which is still present inline in the imported GEDCOM.
“Unsourced Citations” from Ancestry appear as “Source Notes” in FH. These are really no different to normal sources. They appear in the GEDCOM as inline, unreferenced sources. Where they have an image reference this will become unavailable. URL links to the online source are also lost.
Solution: Read all the fields both from the reference and inline part. Create a new referenced source. Note that the PAGE record must remain inline as it is part of the Citation, not the Source record, which GEDCOM defines differently. A single record can be created for all citations of the same SOURce for the same page.
Remove the existing citation. Alternatively all the citation information can be stored in a less general source, along with the image. A third option is to simply store the images against the individual and to keep the FTM source structure intact. After processing the first two options all citations the original FTM sources can be removed.
All Unsourced Citations are turned into referenced sources to preserve any images and to give a consistent presentation.
Example of inline source with an image – although the image is preserved it is not associated with the source.
1 BURI
2 DATE 24 OCT 1636
2 PLAC St Andrew Undershaft, London, England
2 SOUR @S350@
2 SOUR Details: Londinium Redivivium - Volume 1 - James Peller Malcolm Citation Text: Burial in St Andrew Undershaft
3 OBJE @@M56@@
2 OBJE @O56@
(n.b. the source @S350@ is unrelated)
Where they are found the “Details:” and “Citation Text:” markers inserted by Ancestry can be used to split the citation from the source.
6. Military Records from FTM are usually in the custom tag _MILT. FH completely ignores them. SOURces and OBJEcts within them are rendered within double @s - Media items use M rather than O prefixes. The numeric part however is actually a valid FH reference number. I have termed these constructions “Buried references”
(Note that this numeric equivalence does not hold for INDI records as FH moves the root person to be record @I1@. There may also be gaps in the list where deletions have occurred.)
CONCs are separated from the main record in the same way as OCCUpations.
Another form of this record is _MILTID which relates specifically to a serviceperson’s military ID number (as in Name, Rank...). All the above relates to them except that the CONC in this case is correctly placed.
Solution: Recreate them as _ATTR of type Military
Looking at some further examples the separation does not always happen, so it should be treated as a “Might Happen”
7. Employment Records from FTM are in the _EMPLOY tag.
Solution: Move them to Attributes of type Employment. They can contain buried references, and need to be handled in parallel with _MILT
8. URLs in _LINK records do not show up at all
Solution: Put them into the Note field, where they can be seen and can be cut and pasted into a browser. Perhaps will do something better in a future version.
9. Images in OBJE Records are all marked as Pictures in the record window, even when they are documents or stories. Also the wrong main image is selected in the focus window.
Solution: There is a Type column in the Media record window, which displays “Picture” against every image. For the life of me I cannot discover where this can be edited or where it is stored. However there is a Keyword stored in _KEYS which also always contains “Picture”. This can be changed to “Story” where the extension is .htm, and to “Document” for all images attached to SOURces. While this is not perfect, it is right most of the time. The Media record window can be configured to show this rather than “Type”.
The default picture is stored in the _PHOTO tag which can be used to select the correct one by reordering the list. This tag is a buried reference of the form @@M56@@ which can be converted to a FH reference. _The PHOTO tag can then be deleted.
10. ADDRess records stored directly in INDIviduals are simply flagged as UDF. If such an address has any CONCs attached to it, then they are separated from the main record.
Solution: Create a new RESI record and put the Address within it, resolving any CONC issues along the way.
11. REPOsitory records – Email addresses are lost. (While a peripheral matter, it is a loss of data, so I have classed it under Serious Issues) It happens because FH is looking for it in _EMAIL. NB the ‘@’ symbol is doubled – no doubt to escape it from pointer mechanisms, and this is perhaps why FH feels the need to use the custom tag. There are EMAIL tags (no underscore) even where there is no data, brought in from FTM. These are not needed – and are not actually legal GEDCOM. Could it have something to do with the fact that EMAIL is a GEDCOM 5.5.1 beast?
Additionally the Address field in repositories can have CONCs which are lost, as in other places.
Solution: Read the EMAIL tags. Delete them. Where they are not empty create new _EMAIL records. Retain the double @s in the email address. Add the address concatenations to the main Address record. Note that these addresses are often URLs. If any are detected within the original text they are concatenated without a space.
Note: ADDR records are often web addresses, which would perhaps be better moved into _WEB tags. As they are currently visible and working this cannot be classed as a fault, so nothing is done in this plugin.
Examples of actual Repository records – the first has a concatenated address and a blank EMAIL, the second has EMAIL data which is lost. Note the @@.
0 @R12@ REPO
1 NAME Register of Heritage Places, W. Australia
1 ADDR http://register.heritage.wa.gov.au/PDF_ ... ohnsons%20
2 CONC Complex%20(I-AD).PDF
1 EMAIL
0 @R18@ REPO
1 NAME Berkshire Record Office
1 EMAIL arch@@reading.gov.uk
12. Media Records appear headed with the file path, although there is a TITLe in the data. This is because it is at the wrong level.
0 @O953@ OBJE
1 _FILE Media\Apjohn_Hamersley_Winter_Dixon Media\Australian Electoral Rolls 19031954(9).jpg
2 TITL Australian Electoral Rolls, 1903-1954
This follows the incorrect GEDCOM usage of FTM:
0 @M935@ OBJE
1 FILE C:\Users\Ian\Documents\Family Tree Maker\Apjohn_Hamersley_Winter_Dixon Media\1861 England Census(27).jpg
2 TITL 1861 England Census
Solution: Raise the title to level 1 and it becomes visible in FH.
Very occasionally an OBJE record turns up in the NAME tag (not within a source). These are always duplicated elsewhere.
Solution: Simply remove them.
Stylistic Problems:
1. HTML special characters, in particular & and ' are found in various places and have not been tidied up by Ancestry, FTM or FH and appear literally in the text. The former is frequently found in PAGE records, and the latter two in text which has been cut and pasted from a web page into Ancestry.
Solution: They are replaced with Space, Apostrophe and Ampersand characters in these cases.
2. Records too long to display on a single line. These are optionally converted to notes if they are longer than a user specified length (default 100), leaving a specified number of words (default 6) followed by “ …” in the main record.
3. Publication Info. Ancestry generate a huge field in a Source in the “PUBL” tag. This is normally truncated, even on the Ancestry website, so it cannot be completed. Often this is pure dead weight that users may never want to see. With the strategy of recreating sources in FH-friendly form, we have greatly increased the number of source records (except for Method 3), so this overhead is multiplied many times. Therefore if we do not want to lose the information, it is much better to store it in a shared note, where it is also not blasting out on every screen, but can be found if needed. The plugin offers three options: discard it altogether, move it to NOTEs, or keep it in the re-created sources.
4. Numbering of records. There are often a large number of source and media records with the same name.
Solution. They can optionally be numbered either sequentially or based on the numbering in the media file names. Both source and media are given the same name.
General Note on Concatenation
The FTM GEDCOM in general breaks its data on word boundaries, and does not leave a space on the end of the previous (correct) line or the start of the CONC (incorrect). Therefore a special concatenation routine has been used which inserts a space if one is not already there. This is obviously wrong if a split word does occur, but it is right much more often than it is wrong in the specific case of an import from FTM.
Introduction
There are two main routes to get data from Ancestry into Family Historian (v6) (FH) – either directly from a GEDCOM file exported from the Ancestry website, or via Family Tree Maker (2014) (FTM). The latter is route is greatly to be preferred as it brings a complete set of images not only for photographs uploaded to the site, but also images of all the documents cited in Sources where Ancestry has one.
However there are numerous faults with the import, which fall into two categories:-
• Serious – involving apparent loss of data for example truncation of text in various places, generating a worryingly large number of UDFs.
• Stylistic – although data is not lost, operation in FH is less than ideal, for example descriptions being often far too long to display in the single line presented in the user interface (UI).
In spite of these shortcomings, having the media imported makes the FTM route by far the least-worst method in my book. And FH has two great advantages over other programs. As the GEDCOM is the data file for FH you can open it and see exactly what is going on, and with Plugins any problems found can be fixed up. In fact nothing that I can see has actually been lost in the import – it is still in the GEDCOM data in some form. FH just can’t (or won’t?) find it.
Much more extensive loss of data occurs with other software such as Roots Magic, and there is nothing you can do about it (short of decompiling their program).
This document analyses the faults with the FH import with a view to correcting them.
The FindUDF plugin (from the Plugin Store) has been invaluable in highlighting the various issues arising. I am now at the stage where my plugin is completely eliminating them in my own files, though I dare say problems might arise with other people’s different styles of working.
Serious Issues
Truncation of text occurs for more than one reason (1-4 below):-
1. For inline NOTEs any CONC extensions are stored as siblings of the main record rather than as children, and therefore are not displayed:-
1 RESI
2 DATE 2 APR 1911
2 PLAC Brighton, Sussex, England
2 SOUR @S170@
3 PAGE Class: RG14; Piece: 5091; Schedule Number: 152
3 OBJE @O456@
2 NOTE Relation to Head of House: Son. 19 Cavendish Street. Father a lodging
2 CONC house proprietor.
The CONC is understandably not displayed in FH. The original GEDCOM reads:-
1 RESI Relation to Head of House: Son. 19 Cavendish Street. Father a lodging
2 CONC house proprietor.
2 DATE 02 APR 1911
2 PLAC Brighton, Sussex, England
2 SOUR @S170@
3 PAGE Class: RG14; Piece: 5091; Schedule Number: 152
3 OBJE @M456@
FH has “corrected” the illegal inline text for RESI; however the CONC extension was previously at the correct level. (Actually the level number is not changed by import although it should have been.)
Solution: Read both tags directly, concatenate them, delete the CONCs and write the full text back to the NOTE.
NB This problem does not occur in referenced NOTEs.
2. SOURce PAGEs from citations (Where in Source) with CONC extensions. The extension is not displayed, although in this case it is stored with the correct level. This is presumably because FH believes that PAGE records have no business having a CONC.
Solution: Although not the same problem as for NOTEs, the same solution works.
3. _ATTR records which FH has made from FTM Custom Events. This one is very strange! In these cases the CONC extensions are of the correct level, but separated from the record to which they refer:-
1 _ATTR Newburgh Estate Valuation of the Township of Oulston 1849 by Mr H
2 TYPE Farm Valuation
2 DATE 1849
2 PLAC Oulston, Coxwold, N Yorks
2 OBJE @O99@
2 OBJE @O205@
2 OBJE @O267@
2 OBJE @O319@
2 CONC Scott and Thomas Bradley
The FTM original in this case reads:-
1 EVEN Newburgh Estate Valuation of the Township of Oulston 1849 by Mr H
2 CONC Scott and Thomas Bradley
2 TYPE Farm Valuation
2 DATE 1849
2 PLAC Oulston, Coxwold, N Yorks
2 OBJE @M99@
2 OBJE @M205@
2 OBJE @M267@
2 OBJE @M319@
Solution: Read the main record. Cycle through all the children, collecting all CONCs at the correct level. If any are found, then delete the CONC records. Add the collected CONCs to the main record and update it.
4. OCCU Occupation records – CONC text is again separated from the main record. RELI EDUC and PROP records also behaves in this way.
1 OCCU Farmer [his father makes reference to "my son William's farm" in his
2 DATE 1834
2 PLAC Yorkshire
2 CONC Will]
FTM original was:-
1 OCCU Farmer [his father makes reference to "my son William's farm" in his
2 CONC Will]
2 DATE 1834
2 PLAC Yorkshire
Solution: As for 4 above
5. SOURCes from FTM no longer display their linked images. A source generally has both referenced content AND inline content. The Referenced content is general, referring to the whole of a source document, whereas the inline content is specific to the current citation. FH looks only at the referenced source for media and ignores the media reference which is still present inline in the imported GEDCOM.
“Unsourced Citations” from Ancestry appear as “Source Notes” in FH. These are really no different to normal sources. They appear in the GEDCOM as inline, unreferenced sources. Where they have an image reference this will become unavailable. URL links to the online source are also lost.
Solution: Read all the fields both from the reference and inline part. Create a new referenced source. Note that the PAGE record must remain inline as it is part of the Citation, not the Source record, which GEDCOM defines differently. A single record can be created for all citations of the same SOURce for the same page.
Remove the existing citation. Alternatively all the citation information can be stored in a less general source, along with the image. A third option is to simply store the images against the individual and to keep the FTM source structure intact. After processing the first two options all citations the original FTM sources can be removed.
All Unsourced Citations are turned into referenced sources to preserve any images and to give a consistent presentation.
Example of inline source with an image – although the image is preserved it is not associated with the source.
1 BURI
2 DATE 24 OCT 1636
2 PLAC St Andrew Undershaft, London, England
2 SOUR @S350@
2 SOUR Details: Londinium Redivivium - Volume 1 - James Peller Malcolm Citation Text: Burial in St Andrew Undershaft
3 OBJE @@M56@@
2 OBJE @O56@
(n.b. the source @S350@ is unrelated)
Where they are found the “Details:” and “Citation Text:” markers inserted by Ancestry can be used to split the citation from the source.
6. Military Records from FTM are usually in the custom tag _MILT. FH completely ignores them. SOURces and OBJEcts within them are rendered within double @s - Media items use M rather than O prefixes. The numeric part however is actually a valid FH reference number. I have termed these constructions “Buried references”
(Note that this numeric equivalence does not hold for INDI records as FH moves the root person to be record @I1@. There may also be gaps in the list where deletions have occurred.)
CONCs are separated from the main record in the same way as OCCUpations.
Another form of this record is _MILTID which relates specifically to a serviceperson’s military ID number (as in Name, Rank...). All the above relates to them except that the CONC in this case is correctly placed.
Solution: Recreate them as _ATTR of type Military
Looking at some further examples the separation does not always happen, so it should be treated as a “Might Happen”
7. Employment Records from FTM are in the _EMPLOY tag.
Solution: Move them to Attributes of type Employment. They can contain buried references, and need to be handled in parallel with _MILT
8. URLs in _LINK records do not show up at all
Solution: Put them into the Note field, where they can be seen and can be cut and pasted into a browser. Perhaps will do something better in a future version.
9. Images in OBJE Records are all marked as Pictures in the record window, even when they are documents or stories. Also the wrong main image is selected in the focus window.
Solution: There is a Type column in the Media record window, which displays “Picture” against every image. For the life of me I cannot discover where this can be edited or where it is stored. However there is a Keyword stored in _KEYS which also always contains “Picture”. This can be changed to “Story” where the extension is .htm, and to “Document” for all images attached to SOURces. While this is not perfect, it is right most of the time. The Media record window can be configured to show this rather than “Type”.
The default picture is stored in the _PHOTO tag which can be used to select the correct one by reordering the list. This tag is a buried reference of the form @@M56@@ which can be converted to a FH reference. _The PHOTO tag can then be deleted.
10. ADDRess records stored directly in INDIviduals are simply flagged as UDF. If such an address has any CONCs attached to it, then they are separated from the main record.
Solution: Create a new RESI record and put the Address within it, resolving any CONC issues along the way.
11. REPOsitory records – Email addresses are lost. (While a peripheral matter, it is a loss of data, so I have classed it under Serious Issues) It happens because FH is looking for it in _EMAIL. NB the ‘@’ symbol is doubled – no doubt to escape it from pointer mechanisms, and this is perhaps why FH feels the need to use the custom tag. There are EMAIL tags (no underscore) even where there is no data, brought in from FTM. These are not needed – and are not actually legal GEDCOM. Could it have something to do with the fact that EMAIL is a GEDCOM 5.5.1 beast?
Additionally the Address field in repositories can have CONCs which are lost, as in other places.
Solution: Read the EMAIL tags. Delete them. Where they are not empty create new _EMAIL records. Retain the double @s in the email address. Add the address concatenations to the main Address record. Note that these addresses are often URLs. If any are detected within the original text they are concatenated without a space.
Note: ADDR records are often web addresses, which would perhaps be better moved into _WEB tags. As they are currently visible and working this cannot be classed as a fault, so nothing is done in this plugin.
Examples of actual Repository records – the first has a concatenated address and a blank EMAIL, the second has EMAIL data which is lost. Note the @@.
0 @R12@ REPO
1 NAME Register of Heritage Places, W. Australia
1 ADDR http://register.heritage.wa.gov.au/PDF_ ... ohnsons%20
2 CONC Complex%20(I-AD).PDF
1 EMAIL
0 @R18@ REPO
1 NAME Berkshire Record Office
1 EMAIL arch@@reading.gov.uk
12. Media Records appear headed with the file path, although there is a TITLe in the data. This is because it is at the wrong level.
0 @O953@ OBJE
1 _FILE Media\Apjohn_Hamersley_Winter_Dixon Media\Australian Electoral Rolls 19031954(9).jpg
2 TITL Australian Electoral Rolls, 1903-1954
This follows the incorrect GEDCOM usage of FTM:
0 @M935@ OBJE
1 FILE C:\Users\Ian\Documents\Family Tree Maker\Apjohn_Hamersley_Winter_Dixon Media\1861 England Census(27).jpg
2 TITL 1861 England Census
Solution: Raise the title to level 1 and it becomes visible in FH.
Very occasionally an OBJE record turns up in the NAME tag (not within a source). These are always duplicated elsewhere.
Solution: Simply remove them.
Stylistic Problems:
1. HTML special characters, in particular & and ' are found in various places and have not been tidied up by Ancestry, FTM or FH and appear literally in the text. The former is frequently found in PAGE records, and the latter two in text which has been cut and pasted from a web page into Ancestry.
Solution: They are replaced with Space, Apostrophe and Ampersand characters in these cases.
2. Records too long to display on a single line. These are optionally converted to notes if they are longer than a user specified length (default 100), leaving a specified number of words (default 6) followed by “ …” in the main record.
3. Publication Info. Ancestry generate a huge field in a Source in the “PUBL” tag. This is normally truncated, even on the Ancestry website, so it cannot be completed. Often this is pure dead weight that users may never want to see. With the strategy of recreating sources in FH-friendly form, we have greatly increased the number of source records (except for Method 3), so this overhead is multiplied many times. Therefore if we do not want to lose the information, it is much better to store it in a shared note, where it is also not blasting out on every screen, but can be found if needed. The plugin offers three options: discard it altogether, move it to NOTEs, or keep it in the re-created sources.
4. Numbering of records. There are often a large number of source and media records with the same name.
Solution. They can optionally be numbered either sequentially or based on the numbering in the media file names. Both source and media are given the same name.
General Note on Concatenation
The FTM GEDCOM in general breaks its data on word boundaries, and does not leave a space on the end of the previous (correct) line or the start of the CONC (incorrect). Therefore a special concatenation routine has been used which inserts a space if one is not already there. This is obviously wrong if a split word does occur, but it is right much more often than it is wrong in the specific case of an import from FTM.