Page 2 of 3
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 01:51
by BillH
Mike,
A few thoughts on version 2.4.
1. In the Set Preferences tab, I'm a little confused. I read the help, but still can't quite figure out the difference between the Individual Threshold on the User Interface tab and the one on the Names Matching tab. Could you explain the difference?
2. What happened to the Individual Minimum and the Individual Deduction values on the Names Matching tab? Can we no longer set these?
3. I liked the order of the values on the Names Matching tab better in version 2.3. It had the Last Right and Last Wrong together and the Fore Right and Fore Wrong together, and all four of these were grouped together.
4. What happened to Fore Wrong and what is Fore Other?
5. Not a big deal, but 2.4 takes about 1 minute 55 seconds for my file of 10,005 individuals whereas 2.3 takes about 1 min 37 seconds. So it is just a tad bit slower. Still very acceptable though.
6. The incrementing of the time with the percentage works great for me.
Thanks again for a great plugin.
Bill
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 14:32
by tatewise
Bill, thanks for the feed back.
1.
The Threshold on the Names Matching tab is applied after the Names Assessment on each pair of family members (Individual, Father, Mother, etc), to decide if Event Assessment should proceed for that pair.
The Individual Threshold on the User Interface tab is only applied to the pair of Individuals after completing Names Assessment and Event Assessment, to decide if their Relations (Father, Mother, etc) should be assessed.
2.
If after assessing a pair of Individuals, the score has not reached the Names Assessment Threshold, then not only is Event Assessment abandoned, but the pair are eliminated from the Results.
Making their score negative will not affect that decision, unless the Names Assessment Threshold for Individuals is set to 0, which is unlikely.
3.
I suspect it is simply that you have become accustomed to the old order.
The V2.4 order is consistent with how they are used in the assessments, and how the Help & Advice explains them.
It now groups them in order of precedence, firstly for how the Name fields are assessed, and then how the resulting Score is assessed.
So the assessment progresses through the values in the order presented on the tab.
Last Wrong is an overriding (but optional) assessment applied independently of the other scoring.
It seemed inappropriate to put it at the top of the values, since it is optional.
This same logic is applied to all the Set Preferences tabs.
4.
Fore Wrong and Place Part Wrong were really misnomers, and have just been renamed using ...Other.
They mean the name matches, but other than in the correct position.
Whereas Last Wrong does mean the Lastname is wrong and mismatches.
5.
Please check that Names Matching values all match the Defaults except where you have deliberately changed them.
In particular the Threshold under Individual should be 6 not 9.
6.
Since your run times are about 100 secs or so, the 1% steps and 1 sec steps almost coincide.
The problem arises with run times of many minutes or more, but will be resolved in the next release.
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 16:35
by Jane
Mike, I was wondering if you had tried making all the 'child' functions for FindDuplicateRecords() local to it, as they are all called many times making them local might help performance.
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 19:10
by BillH
Mike,
Thanks for the explanations. I think everything is fine the way you have it.
As for the timing issue, #5, I was using the same values for version 2.4 as I used for 2.3. I had changed some values from their defaults.
Version 2.3:
Version 2.4:
Thanks,
Bill
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 22:41
by tatewise
Jane, strangely making those 'child' functions local makes very little difference.
If anything the local function version is marginally slower.
Bill, the slightly longer run time of V2.4 is possibly down to some minor changes in the Name assessments.
One ensures LastNameRight/Wrong checks are infallible now that users have greater freedom over their preferred points.
Another makes Place Name assessment ignore punctuation as well spaces.
Although quite tiny changes, they are executed thousands of times, and it all adds up.
Find Duplicate Individuals Version 2.3+
Posted: 24 Oct 2012 22:45
by BillH
Mike,
Sounds right. As I mentioned, this wasn't really a problem. It is still very fast for me.
Thanks,
Bill
Find Duplicate Individuals Version 2.3+
Posted: 26 Oct 2012 20:58
by johnmorrisoniom
Hi Mike,
Just tried v2.4 on my file of 32014 individuals, 8691 Families. I returned settings to default, apart from last wrong, which I set to -1 instead of 0.
Time taken was just over 35 Minutes, then a gap of 2 minutes thirty five (I managed to time it this time) from when progress bar went, to result set appearing.
Highest total score 43 (17I, 13F, 13M)(Not a Match.
Definitely much faster on a large file.
Find Duplicate Individuals Version 2.3+
Posted: 26 Oct 2012 21:18
by tatewise
I would be interested to discover why it takes so long to get from Progress Bar closure to Result Set.
Could you edit the Plugin and comment out two lines near the end.
On line 2484 and 2486 insert -- double hyphen comment markers at start of the dateTimespan:Set... lines.
These add the Date Timespans to the Result Set but only appear if Enable Diagnostics Mode and Including Date Timespans are both ticked.
Then run the Plugin as before and note the time from Progress Bar closure to Result Set.
[PS EDIT]
Alternatively, run V2.5 and note the time between Progress Bar Messages at the end.
e.g.
Sorting Result Set Candidates
Adding Result Set Timespans
Composing Result Set Entries
Find Duplicate Individuals Version 2.3+
Posted: 26 Oct 2012 22:06
by tatewise
The
Find Duplicate Individuals Version 2.5 is available for download.
I hope this version has reduced run time slightly ~ please let me know.
It adds some progress messages to the
Progress Bar indicating what
Record Id it has reached, and finally what operations it performs on the
Result Set.
When I get time, the
Help & Advice pages for the
Set Preferences tab will get added.
Find Duplicate Individuals Version 2.3+
Posted: 26 Oct 2012 22:38
by BillH
Mike,
2.5 is a little faster for me. My 10,005 person file ran in 1m 33s, whereas 2.4 took about 1m 55s.
Not sure if the progress bar is working like planned or not. For me it said it was working on ID 1 for the first minute or so and then changed and said it was on ID 7931 until the window disappeared. Those were the only two ID numbers that displayed.
Bill
Find Duplicate Individuals Version 2.3+
Posted: 27 Oct 2012 17:40
by tatewise
Run time sounds correct, as V2.5 should be about the same or better than V2.3.
To avoid impacting run time, the Record Id is currently only updated when the Set Preferences tab User Interface tab Memory Conservation limit is reached.
So if few candidates are being discovered, the progress will update infrequently.
I might rework that, but in any case it will eventually be described in the Help & Advice.
Find Duplicate Individuals Version 2.3+
Posted: 27 Oct 2012 18:27
by johnmorrisoniom
Hi Mike,
Version 2.5 does seem slightly faster (Just over 30mins for 32000+ records).
The three messages at the finish of the run, were so fat that I couldn't read them, The progress bar went and 1Min 45 secs later the result set appeared.
2.5 seems to have installed alongside 2.4, so I have both at the moment.
Could this occur if one was installed via a windows XP machine and the other on a windows 7 machine?
The record ID seemed to only update about 5 or 6 times during the run, and I could not work out a pattern as to when it was being updated, more quickly at the beginning with the time span an ID was show getting longer each time ( I would have expected it to be the other way round [I used to know the formula for this, but can't remember it]).
Find Duplicate Individuals Version 2.3+
Posted: 27 Oct 2012 19:21
by Jane
John,
How large is the result set? Lua works in a 'Virtual Machine' so at the end all the table values are passed back to the result set window, so it could be the delay is after Mikes code finishes and FH picks up the data and displays it.
Find Duplicate Individuals Version 2.3+
Posted: 27 Oct 2012 20:01
by tatewise
Yes, I too am interested in why it takes so long for John's
Result Set to appear.
Earlier John has said:
10/22/12 - 09:09:22 Result set is exactly 100 pairs.
10/26/12 - 20:58:54 I returned settings to default
Default
Result Set size is
100 entries.
The only significant thing the Plugin does after outputting the
Result Set is to save the 'sticky' data file, Results Set file, Non-Duplicates list file, and Soundex cache file.
John, how large are these files?
As I explained before, use
Windows Explorer and navigate to the
Plugin Data folder at:
C:Users{user}DocumentsFamily Historian Projects{project}{project}.fh_dataPlugin Data
What are the sizes of:
Find Duplicate Individuals.dat
Find Duplicate Individuals.nondups
Find Duplicate Individuals.results
Find Duplicate Individuals.soundex
John, what are the names of the two Plugins?
They must be different, otherwise one would have overwritten the other.
The expected name from the WiP download would be
find_duplicate_individuals.
When you run the Plugin, how long does it take for the user interface to appear?
Apart from the Result Set file, the Plugin loads the other three files at startup, before displaying the GUI.
Find Duplicate Individuals Version 2.3+
Posted: 27 Oct 2012 22:05
by johnmorrisoniom
Hi Mike
.dat file is 2kb
,nondups file is 22kb
. results file is 121kb
.soundex file is 172kb
result set is 100 pairs
highest score is 37
Lowest is 33
The file name of version 2.5 has a space before the dot
[find_duplicate_individuals .fh_lua]
version 2.4 does not
[find_duplicate_individuals.fh_lua]
When I run the plugin, the interface appears straight away.
For consistency, I have run my tests on my laptop, which is an i3 W7 64 bit
I also can run it on a quad core W7 84bit and a Quad Core Win XP to see if there are any major changes., but that will be Sunday before I can do that.
Find Duplicate Individuals Version 2.3+
Posted: 28 Oct 2012 12:28
by tatewise
The file sizes are not exceptional.
I have Results Files double that size, although I have not seen a Soundex File that large.
The Plugin saves the Soundex File more often than necessary, and I will improve that.
I cannot explain where the space before the dot came from in the find_duplicate_individuals .fh_lua filename.
Did you download differently for V2.5 than for V2.4?
Find Duplicate Individuals Version 2.3+
Posted: 28 Oct 2012 12:44
by johnmorrisoniom
Hi Mike,
I use google chrome on all my computers, but sometimes download on an XP machine, other times on W7.
This has happened before on version 2.0 to 2.1 and also on one of Jane's plugin's.
Find Duplicate Individuals Version 2.3+
Posted: 28 Oct 2012 14:04
by johnmorrisoniom
Hi Mike,
I have managed to replicate the file name occurrence.
I deleted version 2.4, then renamed version 2.5 to remove the space.
I then downloaded a new copy, and double clicked to install it. The new copy was installed alongside the original.
Looking at my downloads folder, this is what I found.
When the file has been run, the number and Brackets have been correctly removed, but not the preceding space.
I seem to remember something similar happening befor on a previous thread, but at that time, the part in brackets was also retained.
Find Duplicate Individuals Version 2.3+
Posted: 28 Oct 2012 20:52
by tatewise
You are right that those symptoms have arisen before, but had supposedly been fixed.
The
Find Duplicate Individuals Version 2.6 is available for download.
This may reduce run time slightly, but mainly updates the
Progress Bar presentation, and adjusts the way
Soundex cache files are loaded & saved.
If
Saving Soundex Cache file takes a long time then it will be apparent in the
Progress Bar messages.
Find Duplicate Individuals Version 2.3+
Posted: 29 Oct 2012 01:23
by johnmorrisoniom
Hi Mike,
Version 2.6 took 32 Minutes with 32061 individuals. About 1min 30 secs to produce result set.
Having the record Id advancing is a definite improvement, as it shows a progression, Id's changed very fast to start with gradually slowing to about 2 per second in the last few percent
Find Duplicate Individuals Version 2.3+
Posted: 29 Oct 2012 17:01
by tatewise
Yes, the Individual Id progression will slow down as explained below.
The 1[sup]st[/sup] Id is compared with nobody, so is very quick.
The 2[sup]nd[/sup] Id is compared with 1[sup]st[/sup] Id, so is still quick.
The 3[sup]rd[/sup] Id is compared with 1[sup]st[/sup] & 2[sup]nd[/sup] Id, but still quick.
The 4[sup]th[/sup] Id is compared with 1[sup]st[/sup] & 2[sup]nd[/sup] & 3[sup]rd[/sup] Id.
You get the picture...
The 1,000[sup]th[/sup] Id is compared with 1[sup]st[/sup] through 999[sup]th[/sup] Id, so slowing down.
The 32,000[sup]th[/sup] Id is compared with 1[sup]st[/sup] through 31,999[sup]th[/sup] Id, so quite slow.
If the 1 min 30 secs to produce Result Set is after Progress Bar closes, then it can only be FH that is slow.
Maybe the large size of the database (32,000+) is the problem, even with a small Result Set of 100 pairs of Individuals.
However, the Plugin Show previous Result Set of Duplicates in Family Historian for the same Result Set displays quickly.
Could it be the Plugin LUA code garbage collecting a complex table with 32000+ entries, one per Individual, when it closes?
John, can you run the Plugin on a smaller database, of say 3,000 Individuals, just to see what happens.
Perhaps Jane or Simon have some ideas.
Find Duplicate Individuals Version 2.3+
Posted: 30 Oct 2012 00:14
by johnmorrisoniom
Hi Mike,
I ran the plugin on a very small data set (288)
Everything happened so fast I didn't even get a progress bar.
Then tried a sub set (4575) of my large file (basically everyone not in pool 1) Plugin ran in 34 seconds with not discernible wait to gt the result set.
I can therefore only think that the wait time I am getting with the full data-set is just FH number crunching after the plugin has finished.
I have also found that although a pair not match, it leads to more investigation that quite often does produce a match.
Find Duplicate Individuals Version 2.3+
Posted: 30 Oct 2012 00:56
by tatewise
The odd thing is that earlier you said using the Plugin Show previous Result Set of Duplicates in Family Historian for the same Result Set displays quickly, despite needing the same FH number crunching.
Find Duplicate Individuals Version 2.3+
Posted: 30 Oct 2012 10:06
by johnmorrisoniom
Hi Mike,
Seperate problem now.
On my reduced dataset project, when I try to load previous result set I get the following error.
Code: Select all
. HistorianPluginsfind_duplicate_individuals.fh_lua:1607: bad argument #2 to 'MoveToRecordById' (number expected, got nil)
stack traceback:
[C]: in function 'MoveToRecordById'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: in function 'strFormatResult'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1632: in function 'doDisplayTables'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1660: in function 'doLoadLists'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1715: in function <... HistorianPluginsfind_duplicate_individuals.fh_lua:1710>
(tail call): ?
[C]: in function 'MainLoop'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1981: in function 'GUI_MainDialogue'
... HistorianPluginsfind_duplicate_individuals.fh_lua:2671: in main chunk.
The whole plugin screen then 'Greys out' and the plugin has to be closed with the rhs X.
This project has only ever had version 2.6 run on it.
When I try to look at the Omit Non-Duplicates list I also get an error:
Code: Select all
... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: bad argument #2 to 'MoveToRecordById' (number expected, got nil)
stack traceback:
[C]: in function 'MoveToRecordById'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1607: in function 'strFormatResult'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1632: in function 'doDisplayTables'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1660: in function 'doLoadLists'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1955: in function <... HistorianPluginsfind_duplicate_individuals.fh_lua:1953>
(tail call): ?
[C]: in function 'MainLoop'
... HistorianPluginsfind_duplicate_individuals.fh_lua:1981: in function 'GUI_MainDialogue'
... HistorianPluginsfind_duplicate_individuals.fh_lua:2671: in main chunk.
The plugin does not 'Grey Out' and navigation back to the main tab is possible, and all button are active.
Find Duplicate Individuals Version 2.3+
Posted: 30 Oct 2012 13:27
by tatewise
That is very odd, and appears to be caused by missing Record Id data in the Non-Duplicates file.
I can only reproduce that error if I manually edit the Find Duplicate Individuals.nondups file.
What is the history of that file in the reduced dataset project?