* Check For Possible Duplicate Media (FH7) plugin - first prototype

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

A first prototype of a new Check For Possible Duplicate Media (FH7) plugin is attached. This offers the following additional features over the existing Check For Possible Duplicate Media plugin for FH7 users, and fixes issues raised in recent discussion:
  • Full support for media file names in any language, and files of any size.
  • Full support for multiple files per media record (although this GEDCOM 5.5.1 feature is not fully implemented in FH7).
  • A more detailed output report, including number of duplicates (not shown if never more than two).
  • Significantly faster run time.
This first prototype is purely read-only, and makes no changes to your project or settings, so is safe to try with any project. The option to merge duplicates (optionally including files with different names but same content) will be added later once we’ve shaken out any initial bugs. I have removed the check for missing files, as this is easy enough to do in FH7, and IMO is peripheral to the intended function.

Technical notes for those interested in the detailed plumbing: The original plugin determines the MD5 hash for all files, using Lua file handling tools that impose the language and size constraints. The new version takes a different approach, and just determines file size in its initial screen, which is quick and easy to do. Files of unique size are discounted, as they cannot have duplicates. Remaining files are passed back to Windows to determine the MD5 hash, giving faster processing and automatic memory and language management. To ensure global compatibility irrespective of local settings, temporary copies are made of any files with non-ASCII characters in their name. A slight complication is that once the Windows processing has been initiated, the plugin just carries on without waiting for it to complete. My solution is to keep checking for the expected output, but is there a better approach?

There is no fundamental reason why this plugin couldn’t be enabled for FH5/6 as well, using plain Lua file handling, luacom/FileSystemObject directly rather than fhFileUtils(), and a simpler progress bar. However, it complicates the coding, and I’m a firm believer in keeping plugins as simple as possible to support ease of maintenance and improve the likelihood of other authors being able to support and adapt them in the future. Those who need the new merging option will be overwhelmingly FH7 users who have imported from another product, and FH5/6 users who have decided not to upgrade have managed just fine with the original for over a decade!

If the attachment has disappeared, check further down the thread for an update - ordinary users can delete attachments to their posts at any time, but cannot edit the text to explain why!
Mark Draper
User avatar
Valkrider
Megastar
Posts: 1571
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Valkrider »

@Mark

Unfortunately it does not work on Crossover on a Mac. It installs fine. On running it just closes FH.

Just FYI as the Mac and Crossover are not run of the mill for FH.
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Mark1834 wrote: 16 Dec 2023 08:25 If the attachment has disappeared, check further down the thread for an update - ordinary users can delete attachments to their posts at any time, but cannot edit the text to explain why!
Presumably in the OPTIONS tab you don't have the Reason for editing this post: box.

I've just run your plugin on my duplicate media test project which has some very long (> 256) file paths and got the attached error message. My plugin specifically tests for and reports such long paths because a user posted that my plugin crashed.

image.png
image.png (17.3 KiB) Viewed 2304 times
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

Mike, i suggest you set up a test account; then you can check what an ordinary user can and can't see.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 16 Dec 2023 10:50 Presumably in the OPTIONS tab you don't have the Reason for editing this post: box.
Correct - it’s only available to Admins (I’m sure you’ve asked that question before ;) ).

I’ve checked before that even if you enable very long filenames in the Registry, FH does not support them (I believe that’s only available for 64-bit apps), so I’ll add that to the list of required mods.

For Crossover, we know that there is no fundamental issue with calling command scripts, so I wonder if it is the specific hash command that’s crashing FH? I’m sure there will be workarounds, even if we have to reduce the scope in an emulator. I’ll have a play with Linux Mint/WINE, as that’s generally less robust than Crossover.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

The technique I use in my plugins to run CMD scripts in Modal mode is:

Code: Select all

	require "luacom"
	local luaShell = luacom.CreateObject("WScript.Shell")
	luaShell:Run('cmd.exe /C \"'..CommandFile..'\"', 0, true )
	-- Parameters: 0 = Hide or 1 = Open, true = Modal waits, or false non-Modal returns
Furthermore, I have run the Certutil -hashfile command directly on files with Unicode file paths and it produces the same MD5 hash code as the Lua md5 library, so I think you never have to copy any Media files:

Code: Select all

>Certutil -hashfile "E:\Mike\OneDrive\Documents\Family Historian Projects V7 Bëta\Family Historian Sample Project\Family Historian Sample Project.fh_data\Media\מארלה גולל'ה.jpg" MD5 >> "E:\Mike\OneDrive\Desktop\MD5"
The only snag I found is the output file contents seems to reduce each Unicode character to ?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

OK, useful options to explore, thanks. I won’t be back at the desk until later (too many outdoor things to do while it’s sunny!), but I’ll have a play.
Mark Draper
User avatar
johnmorrisoniom
Megastar
Posts: 904
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by johnmorrisoniom »

I just tried it, ran ok. Can't compare how long it took to the original version as I got called away from my computer while it was running. (Original would take around 30 mins for 45,000+ media)
The result set is confusing though.
I have 4 duplicate pairs, but the duplicates shown are actually the same media file.
image.png
image.png (27.71 KiB) Viewed 2261 times
Regards
John
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Not only the same file, but the same Media record as well! What happens if you open the records in the Property Box and check the All tab? I think you may have two copies of the same file attached to each record, which the original plugin wouldn’t spot.
Mark Draper
avatar
peterbel
Superstar
Posts: 348
Joined: 21 Nov 2014 20:24
Family Historian: V7
Location: Cornwall

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by peterbel »

Ran it without issue and now checking the output, it was very quick.
Small typo in the PlugIn descriptor, "univeral"
Tracing the Devon Bellamy family along with their partners.
User avatar
johnmorrisoniom
Megastar
Posts: 904
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by johnmorrisoniom »

Quite correct Mark,
2 different files on each Media Record, Probably (Certainly) from a previous duplicates merge operation where I forgot to disconnect one of the records.

Took 7 minutes to run.

Rehards
John
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Further to my earlier suggestion for handling Unicode file path characters here are some more tips.

The plugin must use UTF-8 encoding and that seems to be enforced by fhFileUtils setIupDefaults ()

The CommandFile must start with @echo off and chcp 65001 to accept Unicode characters:

Code: Select all

	table.insert(tblC, '@echo off')
	table.insert(tblC, 'chcp 65001')
The CommandFile must be saved in 'UTF-8' format instead of 'ANSI' format:

Code: Select all

	fhSaveTextFile(CommandFile, table.concat(tblC, '\n'):gsub('>>', '>', 1) .. '\n', 'UTF-8')
The Run command returns a status of 0 for success or a -ve value for failure:

Code: Select all

	local status = luaShell:Run('cmd.exe /C \"'..CommandFile..'\"', 0, true )
	if status == 0 then
There is a flaw somewhere in the design when there are more than two files with the same size and hash code.
The plugin only reports one pair of the Media duplicates instead of all the pairs of duplicates.

[EDIT]
Now that Run CMD is Modal, the -hashfile output can be piped directly to the OutputFile.
Thus the copy of TempOutput to OutputFile is not needed.

Just before fhSaveTextFile insert a blank into tblC and then fhSaveTextFile does not need .. '\n'
It may not matter here, but that technique halves the Lua memory needed to compose the string.

Code: Select all

	table.insert(tblC, '')
	fhSaveTextFile(CommandFile, table.concat(tblC, '\n'):gsub('>>', '>', 1), 'UTF-8')
Since Media files now never need be copied, the long file names are handled without any problem.
Their file size and hash code are calculated despite the long names.
Last edited by tatewise on 17 Dec 2023 14:52, edited 1 time in total.
Reason: Add extra [EDIT] information
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
David2416
Superstar
Posts: 398
Joined: 12 Nov 2017 16:37
Family Historian: V7
Location: Suffolk UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by David2416 »

Ran through 8234 in 32 seconds and result set showing 44 duplicates after further 8 seconds (total 40).

Mostly were parish register pages where several events were recorded so I expected these due to different names/dates so I had multiple copies of the file with differing file names
A couple were images linked twice to the same media record and a few had the same media file with a different title.

Thanks Mark
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 16 Dec 2023 16:16 There is a flaw somewhere in the design when there are more than two files with the same size and hash code.
The plugin only reports one pair of the Media duplicates instead of all the pairs of duplicates.
To use the common CP phrase, this is by design ;). What the plugin is actually tabulating is unique file hashes, not records as such. It gives full details of the first two records, but notes any additional examples in the final column (which is only shown if there is a hash with more than two matches).
Capture.PNG
Capture.PNG (6.71 KiB) Viewed 2182 times
The original plugin reports this as two separate duplicates, which seem to be #1/#2 and #1/#3. Personally, I find it clearer to just report it once. I did have a "status" column in an early draft, similar to your fork, but left it out of the final version as it is something that will need to be considered in a lot more detail once we start adding record merging.

Given the extra detail that new plugin covers, I think the final version will need a basic user interface, probably just a simple iup.Alarm with Run/Help/Close buttons, with the Help button linking to the Store help.
tatewise wrote: 16 Dec 2023 16:16 The CommandFile must start with @echo off and chcp 65001 to accept Unicode characters:
Do we need advice on whether that is ok for a Store plugin? It is a settled CP position that Store plugins must not change any user setting outside the FH space (rightly so, IMO). If changing a code page via a script is purely self-contained and only changes it for that script, it is probably ok, but if it is a permanent change that has to be reversed later, it will be deemed out of scope.
Mark Draper
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

David2416 wrote: 16 Dec 2023 16:51 Ran through 8234 in 32 seconds and result set showing 44 duplicates after further 8 seconds (total 40).
Thanks David. The 32 seconds is the initial screen of file size. It knows how many records there are (it counts them as it loads, which is generally so quick the user doesn't notice the slight pause), so the progress bar can give an accurate position of where it is up to. The subsequent 8 seconds is calculating the hash values. It has no idea how long this will take, so replaces the progress bar with an indefinite whirly wheel, which seems to be current Windows preference for "please wait, but we don't know how long for" (typically a fraction of a second for each image file, but it could be 10-20 seconds or more for very large videos - I tested it with a 2GB mp4 file!).

The fundamental change for this plugin compared with earlier incarnations is that it only calculates hash values for files that might be duplicated, in this case probably around a hundred calculations rather than 8234, which is why it runs so quickly without having to store a cache of previous values.
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

Mark1834 wrote: 16 Dec 2023 17:53 What the plugin is actually tabulating is unique file hashes, not records as such. It gives full details of the first two records, but notes any additional examples in the final column (which is only shown if there is a hash with more than two matches).
Sorry, I did not spot the extra Duplicates column.
When the 3 or more duplicate files have different file paths they don't always get reported.
When it comes to Media record merging it may not be clear which ones will be merged.
Mark1834 wrote: 16 Dec 2023 17:53 Do we need advice on whether that is ok for a Store plugin? It is a settled CP position that Store plugins must not change any user setting outside the FH space (rightly so, IMO). If changing a code page via a script is purely self-contained and only changes it for that script, it is probably ok, but if it is a permanent change that has to be reversed later, it will be deemed out of scope.
There is a Windows Registry setting that would have a global effect, but my suggestion only affects this one script.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

Mark1834 wrote: 16 Dec 2023 17:53
tatewise wrote: 16 Dec 2023 16:16 The CommandFile must start with @echo off and chcp 65001 to accept Unicode characters:
Do we need advice on whether that is ok for a Store plugin? It is a settled CP position that Store plugins must not change any user setting outside the FH space (rightly so, IMO). If changing a code page via a script is purely self-contained and only changes it for that script, it is probably ok, but if it is a permanent change that has to be reversed later, it will be deemed out of scope.
If a failure mid-script would leave the code page altered from the user's choice, IMO that wouldn't be acceptable.
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

tatewise wrote: 16 Dec 2023 18:08 When it comes to Media record merging it may not be clear which ones will be merged.
Exactly - that's why it will need a help page to describe merging in more detail. The more you think about it, the more options there are if we are to merge anything other than records where all fields are identical, but that's a discussion for later - let's get the basic detection fully debugged in all systems first!
Mark Draper
User avatar
tatewise
Megastar
Posts: 28436
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by tatewise »

ColeValleyGirl wrote: 16 Dec 2023 18:10 If a failure mid-script would leave the code page altered from the user's choice, IMO that wouldn't be acceptable.
It doesn't. How many ways can I say my suggestion only affects this one script? When cmd scripts end by whatever mechanism all local changes are cancelled.

It is just like setting the character encoding within a plugin. However, the plugin ends there is no external effect.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5510
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by ColeValleyGirl »

Mike, I didn't mean to imply your method was a problem. I just stated what characteristics (IMO) should be judged to determine whether something was problematic. And I have no idea if CP shares my opinion :)
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

Valkrider wrote: 16 Dec 2023 09:48 Unfortunately it does not work on Crossover on a Mac. It installs fine. On running it just closes FH.
I've traced why that happened. There is a fatal incompatibility between the fhFileUtils() function getFileFolderDetails(...), which I use to get file size and running under WINE. If I replace it with FSO:GetFile(F).Size, where FSO has previously been defined with FSO = luacom.CreateObject('Scripting.FileSystemObject'), the initial phase of excluding unique file sizes runs to completion.

Unfortunately, the Windows script for getting the hash doesn't seem to work under WINE, so I'll need to think about how we deal with that.

It's probably worth reporting that incompatibility to CP. Emulator support for FH is only "best endeavours", but if there is a workaround, I suspect they will want to introduce it. Other fhFileUtils() functions that the plugin uses work ok.

The following simple plugin will close FH if run under WINE:

Code: Select all

fhfu = require('fhFileUtils')
file = fhGetContextInfo('CI_GEDCOM_FILE')
size = fhfu.getFileFolderDetails(file).size|0
Mark Draper
User avatar
Richard_Hyland
Gold
Posts: 28
Joined: 06 Jun 2011 18:04
Family Historian: V7

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Richard_Hyland »

Good Morning

I tried running this twice and got an error message saying

X File processing timed out

Richard
Attachments
Error msg.jpg
Error msg.jpg (10.56 KiB) Viewed 2103 times
User avatar
Mark1834
Megastar
Posts: 2519
Joined: 27 Oct 2017 19:33
Family Historian: V7
Location: South Cheshire, UK

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Mark1834 »

It means that the plugin is having to wait longer than expected for the Windows processing of file hashes to complete (longer than about 75 seconds). There are a couple of likely causes - either you have a lot of potential duplicates containing more complex file names (more than just plain English letters and numbers), or your PC cannot run the script for some reason (is it up to date Windows 10, or an earlier version?).

Do either of those fit your project?
Mark Draper
avatar
victor
Superstar
Posts: 269
Joined: 08 Jan 2004 16:53
Family Historian: V7
Location: Thatcham, Berkshire, England

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by victor »

All my media are stored in separate folders under a main folder titled 'Family information'
When I work on FH I import the appropriate folder using 'Add media' then chose either 'link to exisiting media record' to see if the media is there. If not then I do 'insert from file' that is the file where I have stored my media records in the main folder.
This way I know there is no duplicates in FH just the same media used for different people
Victor
User avatar
Richard_Hyland
Gold
Posts: 28
Joined: 06 Jun 2011 18:04
Family Historian: V7

Re: Check For Possible Duplicate Media (FH7) plugin - first prototype

Post by Richard_Hyland »

Mark

Running Windows 11 Ver 23H2.

There are around 8700 media items in 35 folders. I don't think there are any non English letters & numbers.
(There are some &)

I tried running it on a different project and got a different error.

Richard
Attachments
Error msg2.jpg
Error msg2.jpg (22.84 KiB) Viewed 2066 times
Post Reply