* Check For Possible Duplicate Media (FH7) plugin - first prototype

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
avatar
OlivierM
Famous
Posts: 104
Joined: 30 Jan 2023 04:33
Family Historian: V7
Location: Brussels
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by OlivierM »

Hello Mark,

It works like it is expected to work. No more issues with accented characters etc.
I started with Reunion > 30 years ago, later TMG.
I now use FH as main software, TNG to share my data.
Transkribus to decipher old texts.
Genealogica Grafica, TCGB and My Family Tree to view & check my data. And Genopro for its layered reports.
User avatar
Valkrider
Megastar
Posts: 1571
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Valkrider »

Runs fine on Mac with Crossover.
User avatar
David2416
Superstar
Posts: 398
Joined: 12 Nov 2017 16:37
Family Historian: V7
Location: Suffolk UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by David2416 »

Ran well and fast, produced a list of duplicates as I had anticipated. Like the informative output. (Think the first run was slightly longer at 9 seconds.)
Screenshot 2023-12-23 110002.jpg
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

The new plugin runs well but I think I have identified two problem scenarios.
  1. If there are two identical files but one has Unicode characters in the file path and the other is ANSI only, or one has greater than 250 characters and the other fewer, then they are not reported as duplicates.
    ( They are reported as duplicates by my plugin. )
    That suggests that the Windows MD5 hash is not identical to the Lua MD5 hash.
    [EDIT]
    I've run a modified version of your plugin to always use Windows MD5 hash and it finds those duplicates.
    *
  2. If the Result Set Duplicates value is greater than 2 there may be several Media files & records involved. Some of those records may be identical and some may differ Currently, the Status does not report those possibilities for each pair of Media records.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

  1. Good spot, thanks. It's a typo in the Windows determination, where I omitted to strip out control codes that were not visible in the debugger. Once I corrected match('%c%x+%c') to match('%c(%x+)%c'), it worked as expected.
    Where it will be an issue is with files above 32 MB, where the Unicode version gets the true MD5 hash, and the ANSI version a composite hash from individual smaller file chunks. Rather than force all large files to the same method, which is not desirable, I will just flag it as a "probable match", but not merge records and leave it to the user to sort out.
  2. I think this is essentially the same issue as you raised for the first version, and I said then I would park it until we implement merging. My preferred method at the moment is to merge records that are identical, but flag it for the user to run the plugin again to resolve any remaining pairing.
Both of these rare scenarios will be described in the future help file.
Mark Draper
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

David2416 wrote: 23 Dec 2023 11:26 Ran well and fast, produced a list of duplicates as I had anticipated. Like the informative output. (Think the first run was slightly longer at 9 seconds.)
I've noticed that as well. It can get substantially quicker with repeated runs, so there must be some automatic caching going on in the depths of the operating system.
Mark Draper
avatar
peterbel
Superstar
Posts: 348
Joined: 21 Nov 2014 20:24
Family Historian: V7
Location: Cornwall

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by peterbel »

Ran 2.2 against my data and it found one 'Records Differ', nothing else.
Inspecting the result it was exactly the same image but with two file names linked to two individuals.
However, both individuals are on the image and correctly linked to their FH records.
So a false positive, interesting. :)
Tracing the Devon Bellamy family along with their partners.
avatar
jelv
Megastar
Posts: 611
Joined: 03 Feb 2020 22:57
Family Historian: V7
Location: Mere, Wiltshire

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by jelv »

Where I had two baptisms on the same page I renamed the media record and image to include both names so I only had one.
John Elvin
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Mark1834 wrote: 23 Dec 2023 18:05
  1. Good spot, thanks. ...
    Where it will be an issue is with files above 32 MB...
  2. I think this is essentially the same issue as you raised for the first version, ...
  1. On Windows why not simplify things and use Windows MD5 hash for every file? Is it slower than Lua MD5?
    The files above 32 MB issue only applies to the Mac/Wine OS.
  2. The addition of Status is clearly a step towards merging so I thought a new comment appropriate.
    It needs to be made clear it only applies to the two records listed.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
avatar
jelv
Megastar
Posts: 611
Joined: 03 Feb 2020 22:57
Family Historian: V7
Location: Mere, Wiltshire

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by jelv »

Would 32/64 bit OS be an issue?
John Elvin
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

jelv wrote: 24 Dec 2023 11:07 Would 32/64 bit OS be an issue?
I don't think so because FH is a 32-bit application.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Agree - it is a core design principle of W64 that it is 100% backwards-compatible with W32. The only thing I am aware of in the FH context is that if a badly written plugin tries to access W64-specific features, such as WOW3264 Registry Keys, which will fail in W32. There's hardly any W32 left out in the wild, but emulators such as WINE often use W32 mode.

A more realistic constraint for this plugin is if somebody has a Windows installation that does not support the MD5 command. In this case, it would simply fall back to Lua mode, as version 0.2 does in WINE/Crossover.
Mark Draper
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 24 Dec 2023 10:45 On Windows why not simplify things and use Windows MD5 hash for every file? Is it slower than Lua MD5?
Mike, your experience in testing plugins and anticipating scenarios the author had not considered is always welcome, as were your suggestions when I asked if there was a better way to manage parallel plugin and OS processes in the very first post (which I have adopted with my own additional refinement as described above).

However, can I give you a gentle hint about not overstepping into micromanaging the detailed design...?

Version 0.1 used a single Windows command script to manage all the required hash values. That worked well enough on my system, but two significant flaws became apparent on wider testing. It did not work in Crossover, which is a showstopper. The whole raison d'etre of this new plugin is to add functionality to Jane's original, and that works perfectly in emulators. In addition, having a single script made it difficult to give satisfactory user feedback, as it could take anything from a second or two to more than a minute to run, depending on the number and size of files to be hashed.

Version 0.2 corrects these issues by also using Lua MD5 in parallel with Windows, and where Windows is used, creating one script per file so that progress can be monitored and fed back to the user. Therefore, using Windows MD5 for every file is not simpler - it's three separate steps; creating a script, running it, and reading the returned value. Using Lua MD5 is two steps - reading a file, and calculating the hash.

I therefore use a common Lua MD5 method where it is applicable (files below 32 MB with an ANSI-compatible path) for ease of both development and future maintenance. Only where methods need to diverge (the enhanced scope that the original plugin does not have) do I have separate processing for Windows and emulator, according to which is more appropriate in that environment.

There is absolutely no problem in mixing and matching between Windows and Lua MD5 (once properly coded :)). It's a basic cryptography function that works only if it completely independent of both the hardware and software used to generate it.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

I'm sorry I provoked that response. It was a simple question.
Your Technical Notes introduced those concepts and criteria. I just wondered why. Your answer explains why.
Where was the micromanagement?
It is the season of goodwill to all. Merry Christmas and Happy New Year.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Fair enough Mike. I think it’s worth separating out user features from technical details on how it’s done, as it’s different audiences, and that does invite comment.

Anyway, no more plugins now until after Xmas. Merry Christmas to all!
Mark Draper
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Version 0.3 is attached. This should be the final prototype, apart of course from any issues identified here, and has the following new features:
  • I have expanded the interim report to a full option window, as displayed below for my main project (where the Media Records are a little overdue for some housekeeping):
    The attachment Check for Possible Duplicate Media (FH7).png is no longer available
  • Display Plugin Analysis presents a detailed table of potentially matching records, but does not change any project data. I have expanded it to include all potential pairs, so it is clear exactly which records will be merged, and pairs are identified as Duplicate Record (all record details match exactly), Records Differ (the same file or another file with identical content, but other record details do not match), or Duplicate File (more than one copy of the same file in a single record).
  • List Records With Multiple Files does just that - generates a list of all Media Records that contain more than one file. FH7 does not support this feature very well, and it is likely that any examples of this have been created accidentally.
  • Merge Duplicate Records merges all pairs of records where all details match exactly. These are most likely to arise either from the user accidentally creating a second record from the same file or from a flawed import or merge process. This option changes your project data, so take care if running it on your live project.
  • Plugin Help links to a temporary placeholder page in the Plugin Store, which will be expanded to a more detailed help file once the final version of the plugin is uploaded later this month.
  • Fully functional under WINE/Crossover.
Technical Notes: None this time ( :)). There is nothing particularly innovative in this version, although I have endeavoured to structure the merge functions to operate as quickly and efficiently as possible.
Attachments
Check for Possible Duplicate Media (FH7).png
Check for Possible Duplicate Media (FH7).png (13.88 KiB) Viewed 1581 times
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

That latest plugin version is working well against my test project of duplicate media.
However, I have some observations.

Missing Files: The plugin reports 1 but Tools > External File Links... shows 3 Media record broken file links.
However, that is due to FH not handling very long file paths whereas the plugin does handle them OK.

Records Differ status is applied even when the only difference is the file path/name.
The earlier interest in merging duplicates was due to FH importing Media resulting in multiple copies (now fixed).
In that scenario, the Media records and files were identical except that each file copy had a different suffix.
e.g. Media\File.jpg, Media\File (2).jpg, Media\File (3).jpg.
That same scenario might arise now if a user accidentally adds the same Media file more than once.
This latest plugin does not offer a way of merging those cases which stimulated interest in the original plugin.
My plugin offered to merge Media records if the only difference was the file path/name.

Duplicates Result Set width may be too wide for some screens and has many columns that could be narrower.

Multiple Files Result Set might benefit with having a RecordID column.

Run with FH V6 produces an error message for line 67 and 329 unexpected symbol near '|'
i.e. local size = FSO:GetFile(FileName).Size|0 and if FSO:GetFile(FileName).Size|0 < 2^25 then
I suggest you replace | with or which I think is more meaningful.
Then fhInitialise(..) is invoked and says the plugin cannot be run...

Merge Duplicate Records produced the error message:
image.png
image.png (14.77 KiB) Viewed 1472 times
I suspect it is because it involves a Rich Text link to the Media record.
Last edited by tatewise on 03 Jan 2024 15:34, edited 3 times in total.
Reason: Added merge duplicate records error
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Thanks, Mike - good list.

Records Differ is a deliberate design choice. If you have two media records with the same custom name and different but equivalent files, your fork discards one of the files and merges them unconditionally, which may be undesirable. The import error was fixed over a year ago, so I regard it as a legacy issue. However, I can add extra features later if necessary such as merging if it’s only a suffix difference.

Duplicates Result Set is about 1700 pixels wide, which is ok on a basic HD monitor. It’s a plugin that is only run occasionally, so it’s easy enough to change column widths if you have a smaller monitor. Ideally, it would remember any changes made after generating the list and apply them automatically next time, but I don’t think that’s possible.

V6 is an interesting one. I’ve been caught out before by the way FH7/
Lua 5.3 converts numbers to strings (adds ‘.0’ for something that looks like an integer), and using |0 prevents that. Here we are only testing the value for magnitude, so it doesn’t matter whether it’s a float or the new FH7/Lua 5.3 integer sub-type and we can dispense with the |0 altogether.

Merge - good catch. I’ll fix it in the final version. Good to know that my error trapping worked exactly as intended though!
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Result Set
This year I now have a 1920 pixel width monitor but before my monitor was only 1600 pixels wide.
However, I don't usually have FH displayed full screen width, so 1700 pixels is still rather wide.
You are correct that column width changes cannot be remembered.
Although the plugin may only be needed occasionally, it will probably have to be run several times while resolving the duplications and having to adjust the columns each time would be an annoyance, especially as several columns are rather wider than necessary.

I await the other fixes...
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 03 Jan 2024 15:03 The earlier interest in merging duplicates was due to FH importing Media resulting in multiple copies (now fixed).
In that scenario, the Media records and files were identical except that each file copy had a different suffix.
e.g. Media\File.jpg, Media\File (2).jpg, Media\File (3).jpg.
Clarification question - in the legacy issue, do both the file and media record title gain the suffix when duplicated?

I think the most robust solution is to permit automatic merging of non-identical records only where both the file and record title differ only in their suffix, and all other fields are identical. In other words, they become identical if any suffixes are removed from both the file and title.

As described above, if the current fork is presented with a pair of records with the same custom title and different but equivalent files, it arbitrarily (to the user) chooses one of the files to keep and discards the other, which I do not think is appropriate.

In my scenario, File.jpg, File (2).jpg and File (3).jpg would all be merged, providing that the user has not changed the default record titles - if they have, we must assume it was for a good reason.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Mark1834 wrote: 04 Jan 2024 13:08 Clarification question - in the legacy issue, do both the file and media record title gain the suffix when duplicated?
As far as I can remember the Media record Title does NOT gain the suffix. The records are all identical apart from the File link name.
Mark1834 wrote: 04 Jan 2024 13:08 As described above, if the current fork is presented with a pair of records with the same custom title and different but equivalent files, it arbitrarily (to the user) chooses one of the files to keep and discards the other, which I do not think is appropriate.

In my scenario, File.jpg, File (2).jpg and File (3).jpg would all be merged, providing that the user has not changed the default record titles - if they have, we must assume it was for a good reason.
To be precise, the current fork only merges a pair of records if all the fields are identical, except possibly the File path/name, and the file contents are identical. It does not discard any Media files so the user can easily choose one of the now unlinked files or rename the linked file using the FH V7 rename file feature.

I'm not sure what you mean by "File.jpg, File (2).jpg and File (3).jpg would all be merged". They are 3 identical files.

However, I go along with your proposal to merge Media records where the only difference is the filename suffix.
i.e. All Media record fields are identical except that the File name has a suffix but its content is the same.
The filename without the suffix is the one retained.

Are you proposing to auto-delete the files with the suffix as long as they are not linked to another Media record?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2520
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Precise details to be determined by experiment, but that’s the general idea. If the plugin does delete the redundant file copies, it would only be after requesting specific permission to do so.
Mark Draper
User avatar
johnmorrisoniom
Megastar
Posts: 904
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by johnmorrisoniom »

I would like to see an option to MOVE the unlinked files to a folder of my choice

Regards
John
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

John, what would be the purpose of that MOVE as they would all be identical duplicates of the retained file?
The only difference would be the filename suffix (n).
You can also run the Check for Unlinked Media plugin which has a Move Unlinked Media option, although that plugin only works on media files within the Project Media folder.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
johnmorrisoniom
Megastar
Posts: 904
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by johnmorrisoniom »

Whenever I use the Check for Unlinked Media Plugin I always use the move option, because I want to check each file separately to verify that it does need deleting (Belt and Braces approach).

The new Check for Duplicate Media functionality would do away with the need to run the unlinked media plugin.
Post Reply