* Find Duplicate Individuals Version 2.3+

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 24 Oct 2012 01:51

Mike,

A few thoughts on version 2.4.

1.  In the Set Preferences tab, I'm a little confused.  I read the help, but still can't quite figure out the difference between the Individual Threshold on the User Interface tab and the one on the Names Matching tab.  Could you explain the difference?

2.  What happened to the Individual Minimum and the Individual Deduction values on the Names Matching tab?  Can we no longer set these?

3.  I liked the order of the values on the Names Matching tab better in version 2.3.  It had the Last Right and Last Wrong together and the Fore Right and Fore Wrong together, and all four of these were grouped together.

4.  What happened to Fore Wrong and what is Fore Other?

5.  Not a big deal, but 2.4 takes about 1 minute 55 seconds for my file of 10,005 individuals whereas 2.3 takes about 1 min 37 seconds.  So it is just a tad bit slower.  Still very acceptable though.

6.  The incrementing of the time with the percentage works great for me.  

Thanks again for a great plugin.

Bill

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 24 Oct 2012 14:32

Bill, thanks for the feed back.

1.
The Threshold on the Names Matching tab is applied after the Names Assessment on each pair of family members (Individual, Father, Mother, etc), to decide if Event Assessment should proceed for that pair.
The Individual Threshold on the User Interface tab is only applied to the pair of Individuals after completing Names Assessment and Event Assessment, to decide if their Relations (Father, Mother, etc) should be assessed.

2.
If after assessing a pair of Individuals, the score has not reached the Names Assessment Threshold, then not only is Event Assessment abandoned, but the pair are eliminated from the Results.
Making their score negative will not affect that decision, unless the Names Assessment Threshold for Individuals is set to 0, which is unlikely.

3.
I suspect it is simply that you have become accustomed to the old order.
The V2.4 order is consistent with how they are used in the assessments, and how the Help & Advice explains them.
It now groups them in order of precedence, firstly for how the Name fields are assessed, and then how the resulting Score is assessed.
So the assessment progresses through the values in the order presented on the tab.
Last Wrong is an overriding (but optional) assessment applied independently of the other scoring.
It seemed inappropriate to put it at the top of the values, since it is optional.
This same logic is applied to all the Set Preferences tabs.

4.
Fore Wrong and Place Part Wrong were really misnomers, and have just been renamed using ...Other.
They mean the name matches, but other than in the correct position.
Whereas Last Wrong does mean the Lastname is wrong and mismatches.

5.
Please check that Names Matching values all match the Defaults except where you have deliberately changed them.
In particular the Threshold under Individual should be 6 not 9.

6.
Since your run times are about 100 secs or so, the 1% steps and 1 sec steps almost coincide.
The problem arises with run times of many minutes or more, but will be resolved in the next release.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 24 Oct 2012 16:35

Mike, I was wondering if you had tried making all the 'child' functions for FindDuplicateRecords() local to it, as they are all called many times making them local might help performance.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 24 Oct 2012 19:10

Mike,

Thanks for the explanations.  I think everything is fine the way you have it.

As for the timing issue, #5, I was using the same values for version 2.4 as I used for 2.3.  I had changed some values from their defaults.

Version 2.3:

Image

Version 2.4:

Image

Thanks,

Bill

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 24 Oct 2012 22:41

Jane, strangely making those 'child' functions local makes very little difference.
If anything the local function version is marginally slower.

Bill, the slightly longer run time of V2.4 is possibly down to some minor changes in the Name assessments.
One ensures LastNameRight/Wrong checks are infallible now that users have greater freedom over their preferred points.
Another makes Place Name assessment ignore punctuation as well spaces.
Although quite tiny changes, they are executed thousands of times, and it all adds up.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 24 Oct 2012 22:45

Mike,

Sounds right. As I mentioned, this wasn't really a problem. It is still very fast for me.

Thanks,

Bill

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 26 Oct 2012 20:58

Hi Mike,
Just tried v2.4 on my file of 32014 individuals, 8691 Families. I returned settings to default, apart from last wrong, which I set to -1 instead of 0.
Time taken was just over 35 Minutes, then a gap of 2 minutes thirty five (I managed to time it this time) from when progress bar went, to result set appearing.
Highest total score 43 (17I, 13F, 13M)(Not a Match.

Definitely much faster on a large file.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 26 Oct 2012 21:18

I would be interested to discover why it takes so long to get from Progress Bar closure to Result Set.

Could you edit the Plugin and comment out two lines near the end.
On line 2484 and 2486 insert -- double hyphen comment markers at start of the dateTimespan:Set... lines.

These add the Date Timespans to the Result Set but only appear if Enable Diagnostics Mode and Including Date Timespans are both ticked.

Then run the Plugin as before and note the time from Progress Bar closure to Result Set.

[PS EDIT]
Alternatively, run V2.5 and note the time between Progress Bar Messages at the end.
e.g.
Sorting Result Set Candidates
Adding Result Set Timespans
Composing Result Set Entries
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 26 Oct 2012 22:06

The Find Duplicate Individuals Version 2.5 is available for download.

I hope this version has reduced run time slightly ~ please let me know.

It adds some progress messages to the Progress Bar indicating what Record Id it has reached, and finally what operations it performs on the Result Set.

When I get time, the Help & Advice pages for the Set Preferences tab will get added.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 26 Oct 2012 22:38

Mike,

2.5 is a little faster for me. My 10,005 person file ran in 1m 33s, whereas 2.4 took about 1m 55s.

Not sure if the progress bar is working like planned or not. For me it said it was working on ID 1 for the first minute or so and then changed and said it was on ID 7931 until the window disappeared. Those were the only two ID numbers that displayed.

Bill

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 27 Oct 2012 17:40

Run time sounds correct, as V2.5 should be about the same or better than V2.3.

To avoid impacting run time, the Record Id is currently only updated when the Set Preferences tab User Interface tab Memory Conservation limit is reached.
So if few candidates are being discovered, the progress will update infrequently.
I might rework that, but in any case it will eventually be described in the Help & Advice.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 27 Oct 2012 18:27

Hi Mike,
Version 2.5 does seem slightly faster (Just over 30mins for 32000+ records).
The three messages at the finish of the run, were so fat that I couldn't read them, The progress bar went and 1Min 45 secs later the result set appeared.
2.5 seems to have installed alongside 2.4, so I have both at the moment.
Could this occur if one was installed via a windows XP machine and the other on a windows 7 machine?
The record ID seemed to only update about 5 or 6 times during the run, and I could not work out a pattern as to when it was being updated, more quickly at the beginning with the time span an ID was show getting longer each time ( I would have expected it to be the other way round [I used to know the formula for this, but can't remember it]).

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 27 Oct 2012 19:21

John,

How large is the result set? Lua works in a 'Virtual Machine' so at the end all the table values are passed back to the result set window, so it could be the delay is after Mikes code finishes and FH picks up the data and displays it.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 27 Oct 2012 20:01

Yes, I too am interested in why it takes so long for John's  Result Set to appear.
Earlier John has said:
10/22/12 - 09:09:22   Result set is exactly 100 pairs.
10/26/12 - 20:58:54   I returned settings to default
Default Result Set size is 100 entries.

The only significant thing the Plugin does after outputting the Result Set is to save the 'sticky' data file, Results Set file, Non-Duplicates list file, and Soundex cache file.

John, how large are these files?
As I explained before, use Windows Explorer and navigate to the Plugin Data folder at:
C:Users{user}DocumentsFamily Historian Projects{project}{project}.fh_dataPlugin Data
What are the sizes of:
Find Duplicate Individuals.dat
Find Duplicate Individuals.nondups
Find Duplicate Individuals.results
Find Duplicate Individuals.soundex

John, what are the names of the two Plugins?
They must be different, otherwise one would have overwritten the other.
The expected name from the WiP download would be find_duplicate_individuals.

When you run the Plugin, how long does it take for the user interface to appear?
Apart from the Result Set file, the Plugin loads the other three files at startup, before displaying the GUI.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 27 Oct 2012 22:05

Hi Mike
.dat file is 2kb
,nondups file is 22kb
. results file is 121kb
.soundex file is 172kb

result set is 100 pairs
highest score is 37
Lowest is 33
The file name of version 2.5 has a space before the dot
[find_duplicate_individuals .fh_lua]
version 2.4 does not
[find_duplicate_individuals.fh_lua]

When I run the plugin, the interface appears straight away.

For consistency, I have run my tests on my laptop, which is an i3 W7 64 bit
I also can run it on a quad core W7 84bit and a Quad Core Win XP to see if there are any major changes., but that will be Sunday before I can do that.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 28 Oct 2012 12:28

The file sizes are not exceptional.
I have Results Files double that size, although I have not seen a Soundex File that large.

The Plugin saves the Soundex File more often than necessary, and I will improve that.

I cannot explain where the space before the dot came from in the find_duplicate_individuals .fh_lua filename.
Did you download differently for V2.5 than for V2.4?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 28 Oct 2012 12:44

Hi Mike,
I use google chrome on all my computers, but sometimes download on an XP machine, other times on W7.
This has happened before on version 2.0 to 2.1 and also on one of Jane's plugin's.

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 28 Oct 2012 14:04

Hi Mike,
I have managed to replicate the file name occurrence.

I deleted version 2.4, then renamed version 2.5 to remove the space.

I then downloaded a new copy, and double clicked to install it. The new copy was installed alongside the original.
Looking at my downloads folder, this is what I found.

Image

When the file has been run, the number and Brackets have been correctly removed, but not the preceding space.

I seem to remember something similar happening befor on a previous thread, but at that time, the part in brackets was also retained.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 28 Oct 2012 20:52

You are right that those symptoms have arisen before, but had supposedly been fixed.

The Find Duplicate Individuals Version 2.6 is available for download.

This may reduce run time slightly, but mainly updates the Progress Bar presentation, and adjusts the way Soundex cache files are loaded & saved.

If Saving Soundex Cache file takes a long time then it will be apparent in the Progress Bar messages.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 29 Oct 2012 01:23

Hi Mike,
Version 2.6 took 32 Minutes with 32061 individuals. About 1min 30 secs to produce result set.
Having the record Id advancing is a definite improvement, as it shows a progression, Id's changed very fast to start with gradually slowing to about 2 per second in the last few percent

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 29 Oct 2012 17:01

Yes, the Individual Id progression will slow down as explained below.
The 1[sup]st[/sup] Id is compared with nobody, so is very quick.
The 2[sup]nd[/sup] Id is compared with 1[sup]st[/sup] Id, so is still quick.
The 3[sup]rd[/sup] Id is compared with 1[sup]st[/sup] & 2[sup]nd[/sup] Id, but still quick.
The 4[sup]th[/sup] Id is compared with 1[sup]st[/sup] & 2[sup]nd[/sup] & 3[sup]rd[/sup] Id.
You get the picture...
The 1,000[sup]th[/sup] Id is compared with 1[sup]st[/sup] through 999[sup]th[/sup] Id, so slowing down.
The 32,000[sup]th[/sup] Id is compared with 1[sup]st[/sup] through 31,999[sup]th[/sup] Id, so quite slow.

If the 1 min 30 secs to produce Result Set is after Progress Bar closes, then it can only be FH that is slow.
Maybe the large size of the database (32,000+) is the problem, even with a small Result Set of 100 pairs of Individuals.

However, the Plugin Show previous Result Set of Duplicates in Family Historian for the same Result Set displays quickly.

Could it be the Plugin LUA code garbage collecting a complex table with 32000+ entries, one per Individual, when it closes?

John, can you run the Plugin on a smaller database, of say 3,000 Individuals, just to see what happens.

Perhaps Jane or Simon have some ideas.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 30 Oct 2012 00:14

Hi Mike,
I ran the plugin on a very small data set (288)
Everything happened so fast I didn't even get a progress bar.
Then tried a sub set (4575) of my large file (basically everyone not in pool 1) Plugin ran in 34 seconds with not discernible wait to gt the result set.
I can therefore only think that the wait time I am getting with the full data-set is just FH number crunching after the plugin has finished.
I have also found that although a pair not match, it leads to more investigation that quite often does produce a match.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 30 Oct 2012 00:56

The odd thing is that earlier you said using the Plugin Show previous Result Set of Duplicates in Family Historian for the same Result Set displays quickly, despite needing the same FH number crunching.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 30 Oct 2012 10:06

Hi Mike,
Seperate problem now.
On my reduced dataset project, when I try to load previous result set I get the following error.

Code: Select all

. HistorianPluginsfind_duplicate_individuals.fh_lua:1607: bad argument #2 to 'MoveToRecordById' (number expected, got nil)
stack traceback:
   [C]: in function 'MoveToRecordById'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: in function 'strFormatResult'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1632: in function 'doDisplayTables'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1660: in function 'doLoadLists'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1715: in function <... HistorianPluginsfind_duplicate_individuals.fh_lua:1710>
   (tail call): ?
   [C]: in function 'MainLoop'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1981: in function 'GUI_MainDialogue'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:2671: in main chunk.
The whole plugin screen then 'Greys out' and the plugin has to be closed with the rhs X.

This project has only ever had version 2.6 run on it.

When I try to look at the Omit Non-Duplicates list I also get an error:

Code: Select all

... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: bad argument #2 to 'MoveToRecordById' (number expected, got nil)
stack traceback:
   [C]: in function 'MoveToRecordById'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: in function 'strFormatResult'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1632: in function 'doDisplayTables'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1660: in function 'doLoadLists'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1955: in function <... HistorianPluginsfind_duplicate_individuals.fh_lua:1953>
   (tail call): ?
   [C]: in function 'MainLoop'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:1981: in function 'GUI_MainDialogue'
   ... HistorianPluginsfind_duplicate_individuals.fh_lua:2671: in main chunk.

The plugin does not 'Grey Out' and navigation back to the main tab is possible, and all button are active.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 30 Oct 2012 13:27

That is very odd, and appears to be caused by missing Record Id data in the Non-Duplicates file.
I can only reproduce that error if I manually edit the Find Duplicate Individuals.nondups file.

What is the history of that file in the reduced dataset project?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

Post Reply