* Find Duplicate Individuals

Homeless Posts from the old forum system
User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 28 Jun 2012 22:09

Jane said:
On the Soundex I notice you use a Global for the look up table, my understanding is local variable look ups are much quicker than global ones, so it might be interesting to try setting a local variable from the global one.
That was my understanding, but I experimented with both a Global and Local table, and a Global table was faster.
I guess the Global table only has to be created once, whereas the Local table has to be re-created every time the Soundex function is called.
I will also try your idea of setting a Local table to point at the Global one.

However, I have had break through, and not only slightly speeded up the Soundex function, but revised the Soundex Name matching, which makes the Plugin about 7 times faster, and corrects a previously undetected fault.

Bill, Chris, Colin, Jane ~ Thanks for all your feedback. I am digesting the ideas and hope to include many of them in the next version. See my earlier proposals, and remember many of the current high scorers will get demoted once family relationships are included in the checks.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 29 Jun 2012 00:47

The Find Duplicate Individuals Version 1.2 is now available for download.

This has significant run time improvements, and added better Event Place matching and Family Relationship matching.
See the Work in Progress Notes for more details.

User avatar
BillH
Megastar
Posts: 2184
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 29 Jun 2012 01:07

Mike,

The runtime improvements are very impressive. My runtime went from almost 10 minutes to about 40 seconds for a database of 9000 individuals. Very nice!!

The new rules have helped a lot. However, I'm still seeing a lot of non duplicates at the top of the report. Some of these are individuals that are related (parent/child or siblings). As I mentioned, I think the name similarities are getting so many points that the -3 for being related isn't overcoming the name similarities. Should relatives like this be reported at all? If so, can the amount deducted for being related be higher than -3 points (maybe -20 or something)? Because of all these non-duplicates, a person that I know is a duplicate is not showing on the report because they don't have enough points.

Thanks for all the effort. Are you really up after midnight working on this? [smile]

Thanks,

Bill

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 29 Jun 2012 08:02

Mike,

Tremendous speed improvement my 19 second search now takes less than 2 seconds.

I agree that the -3 isn't sufficient as I have several father and sons, mother and daughters showing up because they have the same name. Maybe a search on the RIN and if within say 10 then maybe take off another 3 off the search score, it may help.

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 29 Jun 2012 08:54

My 2 hour run has come down to 15 minutes.

In an earlier post you mention that the output will only be for the top 12 matches. I can see a potential problem, especially in a large, maybe ONS, ged, in that after a few runs with corrections, the 12 untrue matches may constantly score higher than true ones. I seem to remember but (in the 5 minutes I'm allowing myself before I get on with some work) can't find a statement about only comparing new records. which would mean that does not happen.

If it did, and the top 12 settle to the same records,  maybe you could build in a flag check with an exclude clause for the flag.

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 29 Jun 2012 08:56

... and if I had looked at the output for the last run I would have seen 38, not 12 records, all over score 20.

I still like the flag idea.

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 29 Jun 2012 09:11

To back up my suggestion about the rin check if you look at my latest report you will see that most fall into the same category.

Image

For me flags would also be very useful.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 29 Jun 2012 10:24

Thank you for your continued feedback, which is extremely useful.

I am pleased with the speed improvement, that suggests even large databases can now be thoroughly checked in a reasonable time.
Nevertheless, I may still add some options to check just a subset of Individuals in a later version.

Meanwhile, I am still tinkering with the points system, and adding extra family relationship checks.
Your suggestions with these features are most helpful.

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals

Post by johnmorrisoniom » 29 Jun 2012 15:38

Hi Mike,
Very impressive speed increase,
my 29000+ took less than 25 minutes this time around.

Is there, or can there be, a way to save the result set? And also maybe to choose how many matches in the result set?.

Keep up the good work.

User avatar
RogerF
Famous
Posts: 182
Joined: 26 Apr 2009 16:32
Family Historian: V6.2
Location: Oxfordshire, England
Contact:

Find Duplicate Individuals

Post by RogerF » 29 Jun 2012 16:10

Sorry, didn't get chance to check 1.1, but 1.2 ran through my 12000  individuals in under two minutes. The results look interesting, clearly worthy of further study. Many thanks!

Something that surprised me about the result set -- presumably a FH feature, not down to you, Mike -- was that although the results were sorted on descending score, I couldn't reproduce that sort order once I'd clicked on any of the other column headings; all sorts were then ascending.

User avatar
PeterR
Megastar
Posts: 1129
Joined: 10 Jul 2006 16:55
Family Historian: V7
Location: Northumberland, UK

Find Duplicate Individuals

Post by PeterR » 29 Jun 2012 16:24

John,
Are you wanting to save the result set by a method other than those provided in FH for saving the results of any query or plugin result set?  (Save Result Set As... from the Query Menu.)

Roger,
I can sort any column in descending order by holding down Alt while clicking the column heading.

User avatar
RogerF
Famous
Posts: 182
Joined: 26 Apr 2009 16:32
Family Historian: V6.2
Location: Oxfordshire, England
Contact:

Find Duplicate Individuals

Post by RogerF » 29 Jun 2012 16:28

Many thanks, Peter. D'you know, I was expecting a right-click Ascending/Descending option, so very familiar from the Records window...

User avatar
PeterR
Megastar
Posts: 1129
Joined: 10 Jul 2006 16:55
Family Historian: V7
Location: Northumberland, UK

Find Duplicate Individuals

Post by PeterR » 29 Jun 2012 16:40

Mike,
I've noticed that the selection of two adjacent records in the result set, by holding down Shift, only works properly if the result set has not been sorted manually.  I wonder if this is a strange bug in FH, since any individual record can be selected OK, whatever sorting has been done.

avatar
TimTreeby
Famous
Posts: 168
Joined: 12 Sep 2003 14:56
Family Historian: V6.2
Location: Ogwell, Devon
Contact:

Find Duplicate Individuals

Post by TimTreeby » 29 Jun 2012 17:21

Yes a definite improvement on speed 10,500 individuals down from 20 Mins to 1 Min 45s.
And matching definitely seems to of been improved as well.One plugin which i didn't think i would need but definitely worth using as have found Duplicates, even though was sure that i had none.

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals

Post by johnmorrisoniom » 29 Jun 2012 19:40

Just ran 1.2 on my W7 machine.
Just over 14 mins for 29,816 individuals.

I think there does need to be some method of marking pairs that have already been checked and found to be not duplicates, as I'm starting to get a static top 12

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 29 Jun 2012 20:07

PeterR said:
I've noticed that the selection of two adjacent records in the result set, by holding down Shift, only works properly if the result set has not been sorted manually. I wonder if this is a strange bug in FH, since any individual record can be selected OK, whatever sorting has been done.
This appears to be a strange bug in FH, and contrary to what you say, I find it also affects single record selections too.
It only afflicts Result Sets created by a Plugin, not those created with a Query.

If you hold Alt and click the Score header to restore original order, then the selection becomes correct.
I will report this bug in the V5 Usage thread.

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 30 Jun 2012 19:11

Mike

Another little anomaly in the screen shot below. The latest run has identified husband and wife as duplicates.

Image

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 01 Jul 2012 15:43

The Find Duplicate Individuals Version 1.3 is now available for download.

This version adds a user interface to allow subsets to be selected, both by selecting specific Records, and by setting a last Updated date threshold.
In addition, any Records in a 'Non-Duplicates' Named List will be excluded from matching each other.

Checks for Father and Mother names have been added, and the points scoring adjusted in response to several of your comments.
Please, let me know if there are still any glaring anomalies.

The Result Set now lists the points allocated to each sub-category, to aid diagnosing where the scoring may need further adjusting.

User avatar
LornaCraig
Megastar
Posts: 2996
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals

Post by LornaCraig » 01 Jul 2012 16:47

Hello Mike,
I'm just having a look at the latest version. I note that in the info you say 'At present only the first instance of Birth, Baptism/Christening, Marriage, and Death events are assessed.' Two issues come to mind:

First, does it take account of a burial date, where no death date is known? At present it is suggesting a match (11 points) between an individual born in 1924 and two other individuals each buried in 1771 (where no death date has been entered for the latter two individuals).

Second, does it use the existence of a baptism date to assume a birth date no later than baptism? At present it is suggesting a match between the individual born in 1924 (for whom lots of dates are known, but not a baptism date) and an individual baptised in 1739.

Keep up the good work!

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 01 Jul 2012 17:17

Mike,

Just installed 1.3 and am now getting this error message.

Image

[cry]

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 01 Jul 2012 18:21

Image

Mike,

I'm testing with a subset gained through a query. It actually took 9:45

The description page tells me that it will still check all the records against the subset, which is why, I guess, the estimate vs the actual time differ so.

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 01 Jul 2012 18:34

Image

Mike,

Something strange. I ran the plugin, created a list, Non-Duplicates, did not merge or edit any records.

I then started to run the plugin again, used the 'Include Selected Subset of the Individuals' (I've used the Has Flag query)

The first run has selected 4198 records, the second 4197.

Running the query outside of the plugin gives 4198 records.

Going back into the plugin and it still says 4197.

EDIT: I ran it again and created a names list from within the query. That named list has 4198 records, so I'm afraid the count may be awry.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 01 Jul 2012 18:36

The Find Duplicate Individuals Version 1.4 is now available for download.

Sorry about the faults.

I believe this Version fixes the line 506 'compare number with nil' error, which I think is caused by Individual Records with no Update date.

Chris ~ It also corrects a minor error in the run time estimation formula.

As Lorna suggested, it includes Burial Event data, if there is no Death Event, and checks the chronological order of Event Dates.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 01 Jul 2012 18:44

Chris ~ Is one of your Has Flag Records in the Non-Duplicates Named List?

The count of Records is an amalgam of your selected Has Flag Records, minus any with an Updated date earlier than the chosen Date, and minus any in the Non-Duplicates Named List.

User avatar
LornaCraig
Megastar
Posts: 2996
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals

Post by LornaCraig » 01 Jul 2012 18:51

As Lorna suggested, it includes Burial Event data, if there is no Death Event, and checks the chronological order of Event Dates.
Thanks, 1.4 has got rid of the matches between the individual born in 1924 and the ones buried in 1771.
However it is still suggesting a match (with a score of 11) between the individual born in 1924 and one baptised in 1739, so the chronology test is not quite right.

Locked