* Dealing with duplicate image paths and files

Importing from another genealogy program? This is the place to ask. Questions about Exporting should go in the Exporting sub-forum of the General Usage forum.
Post Reply
User avatar
JP Ford
Diamond
Posts: 86
Joined: 16 Feb 2020 14:11
Family Historian: V6.2
Location: Yorkshire, UK
Contact:

Dealing with duplicate image paths and files

Post by JP Ford » 14 Apr 2020 14:09

After searching around the FHUG, this post seems to fit in several forums, so I've decided to post it here. Please feel free to move it if necessary. I'm looking for a regex or lua pattern script that will accomplish a comparison between two media file names that use these patterns:

Original pattern : 1234_A123456-1234.jpg
Duplicated pattern: 1234_A123456-1234 (1234_A123456-1234).jpg

I use ancestry to do focused intensive research on specific branches of my research lines. My process is to usually create a new tree with a single person who is the beginning of my branch of interest, then I spend hours/days/weeks focused on that person's ancestral or descendant line. Since I've used Ancestry for decades now, this is a very efficient and effective process for me.

Once I've accomplished my research goals (or exhausted my efforts), I import the complete tree & associated media into RootsMagic using TreeSync, then export it to a (non-rm) compatible gedcom. I then edit the gedcom to change the media paths to my FH media folder, move the associated media files into my FH media folder, then import the gedcom into FH as-is. This process works beautifully and it is very fast for me. It does, however, have one hiccup - Ancestry tends to copy media files. As a result, I end up with numerous file names that look like this:
(001).png
(001).png (73.97 KiB) Viewed 5276 times
I can confirm that these are all exact duplicates. As you can see, the media filenames have a consistent naming pattern of five digits, an underscore, an uppercase letter, six digits, a dash, five digits and the extension. The duplicates have the same filename but with the addition of a space, then the exact filename again in parentheses, followed by the extension.

My solution to this duplication has been to simply edit the media path of the duplicated file, removing the space+parentheses so that the path points to the original file. I can then run the "Find Unlinked Media" plugin to confirm and delete the (now unlinked) duplicate files. (Of course, this is done only after a backup of my gedcom prior to the changes).

This file path editing can be done easily enough in the media listing of the records window sorted by Media Record, opening a Properties window and docking it to the side of the media list. Find the duplicated files, change the pathname in the properties window, move down the list. I would prefer a more automated method, if possible. Obviously a regex/string search & replace comes to mind.

I've tested the Search and Replace plugin and I can easily use lua pattern mode to find a list of either the original pattern OR the duplicated file path pattern. This will generate a list of one or the other, but not both. Because there is a possibility that a media file with the duplicated filename pattern exists while the original does not, I need to be able to find those files that match the pattern AND have a file with the same filename in the duplicated pattern.I do not believe the S&R plugin can do this OR, if it can, I do not have the lua script familiarity to do it.

Sorry if this is geeky and convoluted. Any help would be appreciated.
Researching SORRELL and SORELLE families and associated lines.
https://sorrellnotes.us

User avatar
Jane
Site Admin
Posts: 8442
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Re: Dealing with duplicate image paths and files

Post by Jane » 14 Apr 2020 14:35

Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by tatewise » 14 Apr 2020 15:02

I can see why you were not sure which Forum would be best.

I'm not sure you need to edit the RM exported GEDCOM and move the Media files.
Have you tried simply importing that GEDCOM file as a New Project?
It should automatically copy the Media files into the Project and adjust the Media record links.

Jane's Check for Possible Duplicate Media Plugin should find those pairs of files if they truly are duplicates.

If not, then the Search and Replace Plugin will find them with Lua pattern Search of
Media\%d+_[A-Z]%d+%-%d+ ?%(?%d-_?[A-Z]?%d-%-?%d-%)?%.jpg

The 1st part Media\%d+_[A-Z]%d+%-%d+ is probably similar to yours.
The 2nd part ?%(?%d-_?[A-Z]?%d-%-?%d-%)? allows all the components to be missing so matches your short names.
In the long names, the components must exist in the correct order, which is very unlikely except for your imports.

But if you have imported the RM GEDCOM into a New Project of its own, will all the Media files form those pairs?
If so, there is no need for any searching, just merge every pair.
BTW: That could be fully automated with a fairly simple custom Plugin.

When all Media are edited and any RM UDF have been reviewed, as per how_to:import_from_roots_magic|> Import from RootsMagic (RM), only then would you Merge this New Project into your master Project.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
JP Ford
Diamond
Posts: 86
Joined: 16 Feb 2020 14:11
Family Historian: V6.2
Location: Yorkshire, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by JP Ford » 14 Apr 2020 15:36

Thanks Mike. I'll give that string a go and see how it turns out.

Some thoughts as to your ideas/suggestions;

Damn! forgot about the friggin UDF's :oops:

I didn't want to import as a separate project, since I am matching my research back to a specific individual already in my database. But considering the UDF issue, that might be necessary...

Editing the gedcom is a breeze. Takes all of 90 seconds to run a regex search and replace while I am copying media files to the FH folder. A few clicks and I'm done. The import into FH is painless. I know I can "move the media files" with the importation process, but it always copies them into a separate sub-folder, which I don't want.

When I sort the media record by "Record" the pairs fall into place, but there are exceptions. Merging would work just as well, if I could automate the merge to select the original filename (sans the parenthetical repeats).

Okay... back to the drawing board...
Researching SORRELL and SORELLE families and associated lines.
https://sorrellnotes.us

User avatar
JP Ford
Diamond
Posts: 86
Joined: 16 Feb 2020 14:11
Family Historian: V6.2
Location: Yorkshire, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by JP Ford » 14 Apr 2020 15:55

Jane wrote:
14 Apr 2020 14:35
Have you looked at this plugin

https://www.family-historian.co.uk/plug ... try?id=273
Thanks, Jane! I' give it a look
Researching SORRELL and SORELLE families and associated lines.
https://sorrellnotes.us

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by tatewise » 14 Apr 2020 16:00

I strongly suggest you create a New Project from any externally generated GEDCOM.
If the automatic import of Media files go into the wrong the folder, just Move them with Windows File Explorer and then use Tools > External File Links > Auto Repair Links to mend the broken links.
Focus on fixing the UDF such as _UID, _TYPE DOCUMENT, _TYPE PHOTO, _SCBK Y, _PRIM N, etc, etc...
Merging the Media records and files is easier to manage than in your main Project and minimises any risks.

A custom Plugin could perform most of the above in one go.

Only when you have a tidy New Project should it be File Merged into your main Project.
Otherwise, your main Project will get continually polluted with RM junk UDF, etc.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by tatewise » 14 Apr 2020 18:54

The Check for Possible Duplicate Media Plugin is discussed in Library Modules in WINE, Crossover, POM, POL (17629).

There are a couple of outstanding points from earlier.

You say that imported Media files when using New Project end up in an undesirable sub-folder within Media folder.
That is probably because those Media files originally exist in a subfolder of the RM GEDCOM file's folder.
It should be the folder path between those related folders that FH is recreating in the New Project.
Move that RM GEDCOM file somewhere unrelated to the Media files, then they will get imported to the Media folder.
(The reason FH creates those Media subfolders is that users often have an organised folder structure for their Media files and want that preserved when they migrate to FH, so the New Project process obliges. That is explained in the FH Help page Notes for Upgrading Users as advised in how_to:v4:understanding_projects|> Understanding Projects.)

To decide which Media record is Merged with another Media record, choose them appropriately.
In the Select Records To Be Merged dialogue, check the order of two Selected Media Records on the right.
The 1st one at the top will be the record that remains after the Merge and the 2nd one will get deleted.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
JP Ford
Diamond
Posts: 86
Joined: 16 Feb 2020 14:11
Family Historian: V6.2
Location: Yorkshire, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by JP Ford » 15 Apr 2020 18:02

Okay, just an update on my (new ) process of import from a temporary Ancestry tree. Having tried all the recommendations herein, my process for transferring a temporary research tree from Ancestry to FH via RM is now refined:
  1. Once research in Ancestry tree is finished, run RM and import tree into a new database with TreeSync.
  2. Export from RM to GEDCOM file per instructs at how_to:import_from_roots_magic|> Import from RootsMagic (RM), making sure to export GEDCOM to a subfolder of it's own, separate from the RM imported media folder.
  3. Close RM and Run FH. Open the previously exported gedcom in a new temporary FH project. Import related media from RM folder to the project when prompted.
  4. Check media links, Check and manage UDF's, Review and correct Places columns, Save project.
  5. Open Primary Project, and import the temporary project per instructs at glossary:merge_compare_files|> Merge/Compare File
  6. Run checks to confirm that that merge was successful. If so, delete temp projects.
Because I start my research with only one common individual from my primary database, there is only one primary match when I do the import. All others are new individuals, which makes the matching easier and far less error-prone.
Last edited by JP Ford on 16 Apr 2020 14:49, edited 1 time in total.
Researching SORRELL and SORELLE families and associated lines.
https://sorrellnotes.us

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Dealing with duplicate image paths and files

Post by tatewise » 15 Apr 2020 18:17

That is very neat and confirms that using RM TreeSync is a way of migrating Media from Ancestry to FH.

Although unlikely to need changes, don't forget to check all the tabs in the Merge File dialogue.
There may occasionally be Families or Sources or Media or Places that need attention.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

Post Reply