* Find Duplicate Individuals
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
There is now a prototype Work in Progress plugin that may help identify duplicated Individual Records.
See Find Duplicate Individuals.
Currently it only checks single Individual Facts, but I have plans to add checks for Parents and Children.
Comments on its effectiveness would be welcome.
ID:6333
See Find Duplicate Individuals.
Currently it only checks single Individual Facts, but I have plans to add checks for Parents and Children.
Comments on its effectiveness would be welcome.
ID:6333
- RogerF
- Famous
- Posts: 182
- Joined: 26 Apr 2009 16:32
- Family Historian: V6.2
- Location: Oxfordshire, England
- Contact:
Find Duplicate Individuals
It got about 30% of the way through my 12000 individuals before quitting with 'Not enough memory'.
It would certainly be good to include optional checks on, for example, whether two apparently similar individuals have different parents, or whether they have conflicting census entries for the same year, and so on.
I'd find this a very useful plugin
It would certainly be good to include optional checks on, for example, whether two apparently similar individuals have different parents, or whether they have conflicting census entries for the same year, and so on.
I'd find this a very useful plugin
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals
Mike,
I got all the way through my 9000 individuals, but it took about 9 1/2 minutes. While it was running my system really slowed down and on other windows I had open the scroll bar quit working and I was unable to use alt-tab to switch between windows. I have a 3.4 GHz quad core processor with 8 GB of memory. As soon as the plugin ended, everything returned to normal.
As for the results, it listed 9 individuals. None were actually duplicates and most really were not even close to being duplicates. The names were not close and the birth dates were not close. Several were siblings, but had different birth dates.
The scores ranged from 17 to 24. How high do the numbers go? Is 24 considered a high score?
Bill
I got all the way through my 9000 individuals, but it took about 9 1/2 minutes. While it was running my system really slowed down and on other windows I had open the scroll bar quit working and I was unable to use alt-tab to switch between windows. I have a 3.4 GHz quad core processor with 8 GB of memory. As soon as the plugin ended, everything returned to normal.
As for the results, it listed 9 individuals. None were actually duplicates and most really were not even close to being duplicates. The names were not close and the birth dates were not close. Several were siblings, but had different birth dates.
The scores ranged from 17 to 24. How high do the numbers go? Is 24 considered a high score?
Bill
Find Duplicate Individuals
Perhaps it would be better here

Find Duplicate Individuals
I narrowed it down to this record. I guess name 2 is keeling the plugin over.
Find Duplicate Individuals
Hit the 'Not enough memory' after ~15 minutes on the full file.
- johnmorrisoniom
- Megastar
- Posts: 882
- Joined: 18 Dec 2008 07:40
- Family Historian: V7
- Location: Isle of Man
Find Duplicate Individuals
I Got the same error [invalid pattern capture]at about 15% of the file (3 Mins).
My file has 29,000+ Individuals.
I use square brackets to denote unknown names, could this be a link to ChrisM's error, I noticed the the name he had problems with used square brackets?
No problems with the computer slowing down though, even on my laptop (an i3 with 4gb ram W7 64bit)
My file has 29,000+ Individuals.
I use square brackets to denote unknown names, could this be a link to ChrisM's error, I noticed the the name he had problems with used square brackets?
No problems with the computer slowing down though, even on my laptop (an i3 with 4gb ram W7 64bit)
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
Thank you the rapid feedback.
Roger/Chris/John ~ I suspected large databases might cause problems. I had tested it on a 2,000 entry database, which took about 1 minute to complete. I will look into possible ways to improve matters with large data.
Bill ~ Because the Plugin is looping, comparing Individuals, it is very busy, but I will look at ways to allow other tasks to operate at the same time.
Could you list some of the top pairs of Individuals, with all their Names, and full Birth/Marriage/Death Dates so I can investigate.
2 points are allocated for each matching Name, 1 point for matching Name Soundex, 2 points for exactly matching Dates, 1 point for overlapping Date Periods / Date Ranges / Quarter Dates, and 1 point if Place Soundex matches.
Chris ~ Yes, the ? would do it. Will be fixed in next version.
Roger/Chris/John ~ I suspected large databases might cause problems. I had tested it on a 2,000 entry database, which took about 1 minute to complete. I will look into possible ways to improve matters with large data.
Bill ~ Because the Plugin is looping, comparing Individuals, it is very busy, but I will look at ways to allow other tasks to operate at the same time.
Could you list some of the top pairs of Individuals, with all their Names, and full Birth/Marriage/Death Dates so I can investigate.
2 points are allocated for each matching Name, 1 point for matching Name Soundex, 2 points for exactly matching Dates, 1 point for overlapping Date Periods / Date Ranges / Quarter Dates, and 1 point if Place Soundex matches.
Chris ~ Yes, the ? would do it. Will be fixed in next version.
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals
Mike,
Here is what the report looks like.

Here is the name and date info for these individuals.
Johan Sjursen Gaard
Other Names:
Johan Siurson
Johan Sjurson
Joseph S Guard
Joseph S Gaard
Joseph Gard
Birth Date: 14 Dec 1856
Marriage Date: 14 Dec 1880
Death Date: 9 May 1920
Sivert Sjursen Gaard
Other Names:
Syvert //
Severt Gaard
Seivert Gaard
Sivert Siurson
Severt S Gaard
Birth Date: 14 Dec 1856
Marriage Date: Mar 1883
Death Date: 2 Feb 1936
Brite Olsdatter Føleide
Other Names:
Anna //
Brithe Foeleide
Brite Felide
Brette Felide
Brita Hauge
Breita Hauge
Bretta Houge
Garita Hauge
Brita Olsdatter
Britha Hauge
Breta Hauge
Birth Date: 24 Aug 1829
Marriage Date: 19 May 1856
Death Date: 8 Aug 1900
Berta Martinusdatter Føleide
Other Names:
Bertha Hauge
Birth Date: 23 Feb 1856
Marriage Date: 19 Nov 1877
Death Date:
Brita Jonsdatter Hauge
Other Names:
Britha Hauge
Birte Fure
Bertha Fure
Bertha J Hauge
Bertha Hange
Berta Fure
Bertha Haage
Bertha Houge
Bertie Fure
Bertha Hauge
Berta Jonsdatter
Besey Hauge
Birth Date: 22 Aug 1858
Marriage Date: 1879 (app)
Death Date: 30 Jul 1927
Laurits Sjursen Gaard
Other Names:
Laurits Sjurson
Lewis Gaard
Tauritz Gaard
Louis S Gaard
Lourits Siurson
Lauritz S Gard
Birth Date: 6 Jun 1859
Marriage Date: 9 Mar 1885
Death Date: 26 Nov 1930
Jonas Sjursen Gaard
Other Names:
Jon or Joe //
Jonas Siurson
Jonas Sivertsen Gaard
John S Gaard
John S Sivertsen
John Sivertsen
John Sivertson
John Severson
Birth Date: 17 Jun 1848
Marriage Date: bef 1875
Death Date: 27 Jan 1895
John Henry Driftmier
Other Names:
Johannes L Driftmeier
Joseph Henry Driftmier
Birth Date: 14 Jun 1832
Marriage Date: 1855 (app)
Death Date: 11 Nov 1871
Johann Heinrich Driftmier
Other Names:
Joseph Henry Driftmier
H Driftmier
J H Driftmier
Joseph William Driftmeier
J F Driftmier
Birth Date: 11 Nov 1861
Marriage Date: 23 Feb 1882
Death Date: 19 Jan 1937
Brite Andersson Hauge
Other Names:
Brithe Andersdatter
Birth Date: 1765
Marriage Date: 1812
Death Date:
Anne Martinusdatter Føleide
Other Names:
Anne Hauge
Birth Date: 23 Dec 1858
Marriage Date: 19 Nov 1877
Death Date: 2 Nov 1932
Brite Noasdatter Berge
Other Names:
Brithe N Fallide
Birth Date:
Narriage Date: 4 Jul 1826
Death Date:
There are a lot of variations of Brithe along with the surnames Gaard and Hauge. That may confuse things.
However, most of these have far different birth dates which I would think would make them unlikely candidates for being duplicates.
For example, while both Drifmiers have an alternative name of Joseph Henry Driftmier, their birth dates are 29 years apart (and in fact they are father and son).
Hope this helps. Let me know if I can supply anything else.
Thanks,
Bill
Here is what the report looks like.

Here is the name and date info for these individuals.
Johan Sjursen Gaard
Other Names:
Johan Siurson
Johan Sjurson
Joseph S Guard
Joseph S Gaard
Joseph Gard
Birth Date: 14 Dec 1856
Marriage Date: 14 Dec 1880
Death Date: 9 May 1920
Sivert Sjursen Gaard
Other Names:
Syvert //
Severt Gaard
Seivert Gaard
Sivert Siurson
Severt S Gaard
Birth Date: 14 Dec 1856
Marriage Date: Mar 1883
Death Date: 2 Feb 1936
Brite Olsdatter Føleide
Other Names:
Anna //
Brithe Foeleide
Brite Felide
Brette Felide
Brita Hauge
Breita Hauge
Bretta Houge
Garita Hauge
Brita Olsdatter
Britha Hauge
Breta Hauge
Birth Date: 24 Aug 1829
Marriage Date: 19 May 1856
Death Date: 8 Aug 1900
Berta Martinusdatter Føleide
Other Names:
Bertha Hauge
Birth Date: 23 Feb 1856
Marriage Date: 19 Nov 1877
Death Date:
Brita Jonsdatter Hauge
Other Names:
Britha Hauge
Birte Fure
Bertha Fure
Bertha J Hauge
Bertha Hange
Berta Fure
Bertha Haage
Bertha Houge
Bertie Fure
Bertha Hauge
Berta Jonsdatter
Besey Hauge
Birth Date: 22 Aug 1858
Marriage Date: 1879 (app)
Death Date: 30 Jul 1927
Laurits Sjursen Gaard
Other Names:
Laurits Sjurson
Lewis Gaard
Tauritz Gaard
Louis S Gaard
Lourits Siurson
Lauritz S Gard
Birth Date: 6 Jun 1859
Marriage Date: 9 Mar 1885
Death Date: 26 Nov 1930
Jonas Sjursen Gaard
Other Names:
Jon or Joe //
Jonas Siurson
Jonas Sivertsen Gaard
John S Gaard
John S Sivertsen
John Sivertsen
John Sivertson
John Severson
Birth Date: 17 Jun 1848
Marriage Date: bef 1875
Death Date: 27 Jan 1895
John Henry Driftmier
Other Names:
Johannes L Driftmeier
Joseph Henry Driftmier
Birth Date: 14 Jun 1832
Marriage Date: 1855 (app)
Death Date: 11 Nov 1871
Johann Heinrich Driftmier
Other Names:
Joseph Henry Driftmier
H Driftmier
J H Driftmier
Joseph William Driftmeier
J F Driftmier
Birth Date: 11 Nov 1861
Marriage Date: 23 Feb 1882
Death Date: 19 Jan 1937
Brite Andersson Hauge
Other Names:
Brithe Andersdatter
Birth Date: 1765
Marriage Date: 1812
Death Date:
Anne Martinusdatter Føleide
Other Names:
Anne Hauge
Birth Date: 23 Dec 1858
Marriage Date: 19 Nov 1877
Death Date: 2 Nov 1932
Brite Noasdatter Berge
Other Names:
Brithe N Fallide
Birth Date:
Narriage Date: 4 Jul 1826
Death Date:
There are a lot of variations of Brithe along with the surnames Gaard and Hauge. That may confuse things.
However, most of these have far different birth dates which I would think would make them unlikely candidates for being duplicates.
For example, while both Drifmiers have an alternative name of Joseph Henry Driftmier, their birth dates are 29 years apart (and in fact they are father and son).
Hope this helps. Let me know if I can supply anything else.
Thanks,
Bill
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals
Thanks for creating this report.
It works fine on my set-up. I didn't notice the problems that some others have maybe because I have more memory in my machine.
Obviously it is work in progress it doesn't seem to be doing a full soundex search as the names I am researching are Lefever and Lefevre and other similar variants. It doesn't seem to pick these variants up as I know that I have some duplicates with alternatively spelt surnames.
Another thing that would be really useful would be if the report included the RIN so you could print it out and do a match and merge easily UNLESS of course that I have missed something with the plugin and it does M & M more easily.
Thanks once again for producing this plugin.
It works fine on my set-up. I didn't notice the problems that some others have maybe because I have more memory in my machine.
Obviously it is work in progress it doesn't seem to be doing a full soundex search as the names I am researching are Lefever and Lefevre and other similar variants. It doesn't seem to pick these variants up as I know that I have some duplicates with alternatively spelt surnames.
Another thing that would be really useful would be if the report included the RIN so you could print it out and do a match and merge easily UNLESS of course that I have missed something with the plugin and it does M & M more easily.
Thanks once again for producing this plugin.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
Colin ~ It is my understanding that Lefever and Lefevre both Soundex code to L116.
But Soundex code matches alone, only gain 1 point per match, so unless there are other matching Forenames, Nicknames, etc, or matching BMD Dates, the score will not be very high.
Could you give some examples of Duplicates so I can investigate further, with full Name details and BMD Dates.
See the Find Duplicate Individuals notes for details of point scoring.
But Soundex code matches alone, only gain 1 point per match, so unless there are other matching Forenames, Nicknames, etc, or matching BMD Dates, the score will not be very high.
Could you give some examples of Duplicates so I can investigate further, with full Name details and BMD Dates.
See the Find Duplicate Individuals notes for details of point scoring.
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals
Tatewise
Thanks for responding so promptly.
To give you an example the plugin identifies 14 records as duplicates. Running the same gedcom file through Ancestral Quest 14 identifies some 2088 duplicates. Family Tree Maker 2012 identifies 38 duplicates. This is on a file with 1650 individuals in it.
Of the three initially FTM seems to be the one that is probably correct number wise.
It is clear that duplicates causes some issues for all genealogy programmes.
I will spend some time this afternoon sorting my duplicates out and give you some specific data examples.
It would certainly be useful for me if your plugin gave some of the options that AQ does for the duplicates finder, maybe you will consider this for a later version. (Screenshot attached)

Thanks for responding so promptly.
To give you an example the plugin identifies 14 records as duplicates. Running the same gedcom file through Ancestral Quest 14 identifies some 2088 duplicates. Family Tree Maker 2012 identifies 38 duplicates. This is on a file with 1650 individuals in it.
Of the three initially FTM seems to be the one that is probably correct number wise.
It is clear that duplicates causes some issues for all genealogy programmes.
I will spend some time this afternoon sorting my duplicates out and give you some specific data examples.
It would certainly be useful for me if your plugin gave some of the options that AQ does for the duplicates finder, maybe you will consider this for a later version. (Screenshot attached)

- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals
I have sorted out some duplicates this afternoon as promised.
This one is probably a better example
This is what Ancestral Quest reported

This is what The Duplicates query reported

This is the first duplicate record

This is the second duplicate

If you need the gedcom or anything else please let me know.
This one is probably a better example
This is what Ancestral Quest reported

This is what The Duplicates query reported

This is the first duplicate record

This is the second duplicate

If you need the gedcom or anything else please let me know.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
The Find Duplicate Individuals Version 1.1 is now available for download.
I think this has fixed all the above reported faults.
Bill ~ I have entered Johan Sjursen Gaard and Sivert Sjursen Gaard name and date details into my project, but only get a Score of 15 points.
Are there any other Alternate Name Prefix/Suffix/Nickname/Given Name details you have not mentioned?
Perhaps post a copy of the Names/Titles window obtained by clicking more (+)... blue link on Main tab of Property Box.
Because your data illustrates multiple Alternate Names, I am considering placing a limit on the points available for Name matches.
Colin ~ Currently the Result Set only lists the top dozen or so Duplicate Individual Candidates on the assumption that there would only be a handful of genuine duplicates.
The example you give for Emma TYERS would currently score about 9 points, and is probably only a few entries off the bottom of your Result Set.
Thank you for the Ancestral Quest options, something like that may appear in a later version.
All ~ See the WiP Notes for advice on comparing pairs of Individuals in the Result Set.
Note that the Result Set is only a list of Candidates, especially since at present I am only comparing Individual Data and not Family Relations.
Once the teething problems are resolved, I plan to add checks for Spouses, Parents, Children, etc.
This should avoid some of the more obviously unlikely candidates due to such relationships.
I think this has fixed all the above reported faults.
Bill ~ I have entered Johan Sjursen Gaard and Sivert Sjursen Gaard name and date details into my project, but only get a Score of 15 points.
Are there any other Alternate Name Prefix/Suffix/Nickname/Given Name details you have not mentioned?
Perhaps post a copy of the Names/Titles window obtained by clicking more (+)... blue link on Main tab of Property Box.
Because your data illustrates multiple Alternate Names, I am considering placing a limit on the points available for Name matches.
Colin ~ Currently the Result Set only lists the top dozen or so Duplicate Individual Candidates on the assumption that there would only be a handful of genuine duplicates.
The example you give for Emma TYERS would currently score about 9 points, and is probably only a few entries off the bottom of your Result Set.
Thank you for the Ancestral Quest options, something like that may appear in a later version.
All ~ See the WiP Notes for advice on comparing pairs of Individuals in the Result Set.
Note that the Result Set is only a list of Candidates, especially since at present I am only comparing Individual Data and not Family Relations.
Once the teething problems are resolved, I plan to add checks for Spouses, Parents, Children, etc.
This should avoid some of the more obviously unlikely candidates due to such relationships.
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals
Mike,
Here is the info you requested.


Even 15 seems like a high score. Is it because their 'middle' names are both Sjursen? This is a problem for all of my ancestors from Norway. I put their patronymic in as their middle name and their farm name as their surname. This is a common way to handle Scandinavian patronymic names.
I wasn't sure if you included baptism and confirmatin dates. These two were twins and they were baptized and confirmed on the same date as well.
Some of the other examples are of people that aren't even in the same family.
Thanks,
Bill
Here is the info you requested.


Even 15 seems like a high score. Is it because their 'middle' names are both Sjursen? This is a problem for all of my ancestors from Norway. I put their patronymic in as their middle name and their farm name as their surname. This is a common way to handle Scandinavian patronymic names.
I wasn't sure if you included baptism and confirmatin dates. These two were twins and they were baptized and confirmed on the same date as well.
Some of the other examples are of people that aren't even in the same family.
Thanks,
Bill
- johnmorrisoniom
- Megastar
- Posts: 882
- Joined: 18 Dec 2008 07:40
- Family Historian: V7
- Location: Isle of Man
Find Duplicate Individuals
Hi Mike,
Tried V1.1 this morning.
Took just under 3 hrs to process 29000+ Individuals on a 2.4G Quad core running XP SP3 3GB Ram (But I was doing lots of other stuff at the same time), but no errors this time.
Results were:
'17' '28039' 'Philip Christopher Teare' '24039' 'Philip Christopher Teare'
'15' '24423' 'John James Killip' '24421' 'Thomas William Killip'
'15' '16054' 'Margaret Ann Fayle' '9964' 'Margaret Ann Fayle'
'15' '16325' 'Thomas Watterson' '16324' 'Edward Watterson'
'14' '13273' 'Maryon Scott Pilcher' '13272' 'Eunice Scott Pilcher'
'14' '9409' 'Laura Evans' '9317' 'Laura Evans'
'13' '13151' 'Martha Elizabeth Bridgford' '3953' 'Martha Elizabeth Bridgford'
'13' '15361' 'Gertrude Tudor' '12474' 'Beatrice Tudor'
'13' '25053' 'Emily May Teare' '10582' 'Emily May Teare'
'13' '1169' 'John Joseph Kenna' '510' 'John Joseph Kenna'
'13' '22020' 'Mary Elizabeth Carver' '18867' 'Mary Elizabeth Carver'
'13' '7988' 'John Magor Cardell' '7985' 'George Magor Cardell'
Some are obviously not matches, but there are a few that I need to check out.
Tried V1.1 this morning.
Took just under 3 hrs to process 29000+ Individuals on a 2.4G Quad core running XP SP3 3GB Ram (But I was doing lots of other stuff at the same time), but no errors this time.
Results were:
'17' '28039' 'Philip Christopher Teare' '24039' 'Philip Christopher Teare'
'15' '24423' 'John James Killip' '24421' 'Thomas William Killip'
'15' '16054' 'Margaret Ann Fayle' '9964' 'Margaret Ann Fayle'
'15' '16325' 'Thomas Watterson' '16324' 'Edward Watterson'
'14' '13273' 'Maryon Scott Pilcher' '13272' 'Eunice Scott Pilcher'
'14' '9409' 'Laura Evans' '9317' 'Laura Evans'
'13' '13151' 'Martha Elizabeth Bridgford' '3953' 'Martha Elizabeth Bridgford'
'13' '15361' 'Gertrude Tudor' '12474' 'Beatrice Tudor'
'13' '25053' 'Emily May Teare' '10582' 'Emily May Teare'
'13' '1169' 'John Joseph Kenna' '510' 'John Joseph Kenna'
'13' '22020' 'Mary Elizabeth Carver' '18867' 'Mary Elizabeth Carver'
'13' '7988' 'John Magor Cardell' '7985' 'George Magor Cardell'
Some are obviously not matches, but there are a few that I need to check out.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
Bill ~ The matches are:
Surnames GAARD & SIURSON and Forenames Sjursen & S giving a score of 4 x 3 = 12.
Born same Date & Place scores 4.
Christened same Date & Place scores 4.
Married in same Place scored 1 in V1.0.
Died in same Place scored 1 in V1.0.
This totals 22 (not sure where other 2 points come from).
But hey Bill, they are twins, with so much in common is it surprising they are Candidate Duplicates?
I give the Plugin a gold star for matching them at this stage of WiP!
Since the faults appear to be fixed, I plan the following changes in the next Version:
Limit the Name match score to 10 points to avoid overwhelming the result when many Alternate Names match.
If the number of Children matches then add 1 point.
If the Event Dates differ, then don't match Event Places.
If the Event Dates both exist but differ, then deduct 1 point.
If there is a Gender mismatch then deduct 3 points.
If the Individuals are closely related then deduct 3 points.
Include checks for Parent and Spouse name matches.
Maybe look at possible performance speed up for big databases.
Surnames GAARD & SIURSON and Forenames Sjursen & S giving a score of 4 x 3 = 12.
Born same Date & Place scores 4.
Christened same Date & Place scores 4.
Married in same Place scored 1 in V1.0.
Died in same Place scored 1 in V1.0.
This totals 22 (not sure where other 2 points come from).
But hey Bill, they are twins, with so much in common is it surprising they are Candidate Duplicates?
I give the Plugin a gold star for matching them at this stage of WiP!
Since the faults appear to be fixed, I plan the following changes in the next Version:
Limit the Name match score to 10 points to avoid overwhelming the result when many Alternate Names match.
If the number of Children matches then add 1 point.
If the Event Dates differ, then don't match Event Places.
If the Event Dates both exist but differ, then deduct 1 point.
If there is a Gender mismatch then deduct 3 points.
If the Individuals are closely related then deduct 3 points.
Include checks for Parent and Spouse name matches.
Maybe look at possible performance speed up for big databases.
Find Duplicate Individuals
...'Maybe look at possible performance speed up for big databases.'
Yes please
33000 individuals

Yes please
33000 individuals
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
John & Chris ~ Chris's 33K Individuals in 2 hours, stands up well against John's 29K in 3 hours.
If there are X Individuals, the number of comparisons needed is X squared divided by 2.
Which works out at about 40K to 50K comparisons per second!
So don't hold your breath for any dramatic improvements I'm afraid, and make sure you capture the Result Set or use Merge/Compare Records as per the WiP Notes.
If there are X Individuals, the number of comparisons needed is X squared divided by 2.
Which works out at about 40K to 50K comparisons per second!
So don't hold your breath for any dramatic improvements I'm afraid, and make sure you capture the Result Set or use Merge/Compare Records as per the WiP Notes.
- Jane
- Site Admin
- Posts: 8440
- Joined: 01 Nov 2002 15:00
- Family Historian: V7
- Location: Somerset, England
- Contact:
Find Duplicate Individuals
Mike, your plugin is looking good, I wonder for people with large files, if it might be worth allowing a subset of records to be worked on, select them in the normal record select and compare them only, rather than processing the whole file?
Find Duplicate Individuals
I did go into the plugin later and hack it to remove the soundex statements. From the rate it was going, before I stopped it, I estimated that it would take 20 minutes to complete.
It did find four real duplicates including one person who had three individual records.
The highest score was 15
I even worked out how to Branch Match after all these years!
It did find four real duplicates including one person who had three individual records.
The highest score was 15
I even worked out how to Branch Match after all these years!
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals
Jane ~ I had thought of something like that, but wanted to get the full search scenario working first.
Even if the user selects a Subset, they still have to be compared against every Individual, and eventually the sum of the Subsets must include every Individual.
So the total time would actually double, but could be tackled piecemeal.
A strategy I am considering is that the user runs the Plugin once on the entire database, to eliminate any existing duplicates.
Then on subsequent runs only the Subset that have been updated by checking =LastUpdated(%INDI%) would need to be checked against all Individuals.
Chris ~ Glad the Plugin is helping you, even though it took 2 hours.
[EDIT]
Thanks for heads up on Soundex ~ I have made the function more efficient and reduced runtime a bit.
Beware that the Progress Bar is deceptive because the Progress is non-linear.
My 2,000 database reaches 50% in 11 seconds but takes nearly 50 seconds to complete.
This is because the 1st Individual is not compared at all.
The 2nd Individual is compared with 1st.
The 3rd Individual is compared with 1st & 2nd.
The 4th Individual is compared with 1st, 2nd & 3rd.
...and so on, each Individual being compared with all its predecessors.
I might change it so Progress is linear, and thus more predictable.
Even if the user selects a Subset, they still have to be compared against every Individual, and eventually the sum of the Subsets must include every Individual.
So the total time would actually double, but could be tackled piecemeal.
A strategy I am considering is that the user runs the Plugin once on the entire database, to eliminate any existing duplicates.
Then on subsequent runs only the Subset that have been updated by checking =LastUpdated(%INDI%) would need to be checked against all Individuals.
Chris ~ Glad the Plugin is helping you, even though it took 2 hours.
[EDIT]
Thanks for heads up on Soundex ~ I have made the function more efficient and reduced runtime a bit.
Beware that the Progress Bar is deceptive because the Progress is non-linear.
My 2,000 database reaches 50% in 11 seconds but takes nearly 50 seconds to complete.
This is because the 1st Individual is not compared at all.
The 2nd Individual is compared with 1st.
The 3rd Individual is compared with 1st & 2nd.
The 4th Individual is compared with 1st, 2nd & 3rd.
...and so on, each Individual being compared with all its predecessors.
I might change it so Progress is linear, and thus more predictable.
- Jane
- Site Admin
- Posts: 8440
- Joined: 01 Nov 2002 15:00
- Family Historian: V7
- Location: Somerset, England
- Contact:
Find Duplicate Individuals
I was actually thinking you would just compare with in the Surname. So for example I would create a query to pick out all the people with a variant of Scadden and let the plugin just compare with in that list. Not so useful for ONS databases, but for data like mine which contains many names it would be a bit quicker.Jane ~ I had thought of something like that, but wanted to get the full search scenario working first.
Even if the user selects a Subset, they still have to be compared against every Individual, and eventually the sum of the Subsets must include every Individual.
On the Soundex I notice you use a Global for the look up table, my understanding is local variable look ups are much quicker than global ones, so it might be interesting to try setting a local variable from the global one.
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals
Mike
Another couple that it didn't identify

and

Thanks for adding the record numbers to the new version. BTW 1650 records search on a quad core 64bit machine with 16gb of ram took 19 seconds to run.
Another couple that it didn't identify

and

Thanks for adding the record numbers to the new version. BTW 1650 records search on a quad core 64bit machine with 16gb of ram took 19 seconds to run.
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals
Mike,
Yes, I think the plugin is doing a great job. Thanks for programming it and making it available to us.
With so many things in common, Johan and Sivert would be good candidates except for the fact that they are siblings. Could there be an option to exclude someone as a candidate if they are a parent to, a sibling of, or child of the other person?
Could there be an option to exclude middle names? As I mentioned, this causes problems for all my Scandinavian ancestors.
Some of the candidates on my list are obviously not really candidates to be duplicates because they were born over 100 years apart. Even so, they are getting a score of 16 or higher. Could there be an option to exclude candidates if their birth year is more than a specified number of years apart from the other person?
I think the number of points being awarded for matching names is making them look like duplicates when they really aren't. If the dates are all wrong then it really shouldn't matter how closely the names match should it?
I had a definite duplicate that wasn't showing up in the list. I think this is because the name and birth date matched, but that was it. The person had no alternate names and no other dates. Since the point score was so low, this person wasn't showing up in the results and a lot of non-duplicates were showing up. Could there be an option to show more individuals in the list, or show the list for everyone with more than a specified number of points?
Thanks
Bill
Yes, I think the plugin is doing a great job. Thanks for programming it and making it available to us.
With so many things in common, Johan and Sivert would be good candidates except for the fact that they are siblings. Could there be an option to exclude someone as a candidate if they are a parent to, a sibling of, or child of the other person?
Could there be an option to exclude middle names? As I mentioned, this causes problems for all my Scandinavian ancestors.
Some of the candidates on my list are obviously not really candidates to be duplicates because they were born over 100 years apart. Even so, they are getting a score of 16 or higher. Could there be an option to exclude candidates if their birth year is more than a specified number of years apart from the other person?
I think the number of points being awarded for matching names is making them look like duplicates when they really aren't. If the dates are all wrong then it really shouldn't matter how closely the names match should it?
I had a definite duplicate that wasn't showing up in the list. I think this is because the name and birth date matched, but that was it. The person had no alternate names and no other dates. Since the point score was so low, this person wasn't showing up in the results and a lot of non-duplicates were showing up. Could there be an option to show more individuals in the list, or show the list for everyone with more than a specified number of points?
Thanks
Bill