* Find Duplicate Individuals

Homeless Posts from the old forum system
User avatar
LornaCraig
Megastar
Posts: 2995
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals

Post by LornaCraig » 01 Jul 2012 18:51

As Lorna suggested, it includes Burial Event data, if there is no Death Event, and checks the chronological order of Event Dates.
Thanks, 1.4 has got rid of the matches between the individual born in 1924 and the ones buried in 1771.
However it is still suggesting a match (with a score of 11) between the individual born in 1924 and one baptised in 1739, so the chronology test is not quite right.

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 01 Jul 2012 19:24

Mike

a) Estimate was 4:30, actual 10:30, 4198 records

b) Sorry, but I'm still one adrift.
No of individuals = 33098
After creation of Non-Duplicates = 32941
Non-Duplicates = 157 - verified from Named List (there are a few multiple matches )


Image

and then after selecting by query

No of individuals found by query (I made a named list xsx = 4198
Minus those in Non-Duplicates (157) = 4042
Shouldn't it be 4041?

It may be because of the instances of multiple matches?

Image

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 01 Jul 2012 19:27

Mike,

[grin]V1.4 works fine it returns more matches than before and seems to be working much better for me.

I think that -5 for the Gender is insufficient as I am still getting one showing up with a score of 17 so perhaps nearer to -10 might be a better bet.

Thanks very much for your continued work on this.

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 01 Jul 2012 19:57

Mike,

Version 1.4 is looking really good. Thank you so much for showing more people in the list. I was able to find some actual duplicates and clean them up.

A small issue pointed out by others is the time estimate. Mine shows an estimate of 2:38 for just under 9000 individuals, but it actually takes about 47 seconds. I'm not complaining, the 47 seconds is great.

The second line on my list is for two individuals that are not duplicates, but end up with 18 points. 6 for Names, 3 for Spouse, and 9 for Children.

As for the parents, both men are named William and both women have a middle name of Ann, but that is the only similarity. Neither of the husbands or their wives have any alternative names. Both husbands have 'Sr' as a suffix on their names. Seems like 9 points for them might be too high.

Here are what they look like with their children.

Image

Image

There are 9 children in both families, but the only names in common are James and William. 9 points for 9 children seems high when only 2 match up. Also, the children are born about 40 years later in one family than the other. Maybe there shouldn't be 1 point awarded for each child? Maybe only if the children's names and approximate birth dates match up?

One last note... I'm still getting siblings showing up as possible duplicates.

Thanks again for all the work on this!

Bill

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 01 Jul 2012 21:03

Chris ~ Is one of the Non-Duplicates in Named List NOT one of your Has Flag Records?
If so, that Record has already been excluded by your selection, so the Non-Duplicates exclusion won't exclude it again.

It looks like the time estimate is sometimes about double the actual, and sometimes about half the actual, so the Plugin needs to give a range of times to cope with different PC performance.

Lorna ~ Yes, I need to work on the Date chronology checks. The V1.4 attempt was a quick fix along with the necessary bug fixes.

I suspect it is inevitable that some Candidates will show up that are not Duplicates, whatever scoring strategy is used.
If more comparison checks are needed to get the Candidates accurate, then the run time will increase.
That is what the Non-Duplicates Named List is for, but I think we are still on a learning curve.

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 01 Jul 2012 21:35

Mike,

I agree that some non-duplicates will always appear. For my example, I really just was surprised that two families that are so different and whose children were 40 years apart could end up with 18 points and be at the top of my list while some duplicates that are really duplicates were showing up well down the list. I thought maybe the counts just needed tweaking a little.

Is there any chance that you could add an option to allow us to exclude siblings? Two of the highest rated lines on my report are siblings and there are a few others down the list. In my mind siblings would never be duplicates.

Thanks,

Bill

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals

Post by Cambiz » 01 Jul 2012 22:24

Mike,

Hush my pups! How did you suss that out?

Yes, a William John Wilton has sneaked his way into the Non-Duplicates named list.

I guess he matched against a William H Michell of St Cleer, Cornwall who born 23 years before him, same place.
or
William John Michell of Millom, Cumberland who has the same dates as Wilton.

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 01 Jul 2012 22:44

Mike,

I have a thought on the Non-Duplicates named list process. It seems like this could exclude people from the comparison that you don't really want to exclude. It is really a pair of names that you want to exclude, not an individual.

For example, if A is not a duplicate of B and C is not a duplicate of D, then under this process, these four individuals would be added to the named list. However, A could be a duplicate of C, A could be a duplicate of D, B could be a duplicate of C, and B could be a duplicate of D. Don't you really need a list of pairs of individuals that are not duplicates?

Bill

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 02 Jul 2012 00:00

The Non-Duplicates Named List seemed like a workable solution that I could implement quickly to assess its capabilities, and I welcome your inspection of the logic.

A is non-duplicate of B
C is non-duplicate of D

If A or B were a duplicate of C or D, then those pairings should appear in the original list, and be analysed & merged by the user.
Thus there should be either two merged Records, or one merged Record with two unchecked Records, none of which should be placed in the Non-Duplicates Named List.

I am hoping that by tweaking the scoring and adding more checks, only a handful of false positive Non-Duplicates will need to be placed in the Named List.
Also from time to time, the user could temporarily rename the Named List to allow its member Records to be rechecked.

The problem I have is that the analysis and merging has to be performed using the Result Set in FH after the Plugin has finished.
So how does the user tell the Plugin which pairs of Records are Non-Duplicates?

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 02 Jul 2012 00:28

Mike,

I really appreciate that you came up with the named list idea. I think it is a great start, but doesn't exactly meet the requirement.
A is non-duplicate of B
C is non-duplicate of D
If A or B were a duplicate of C or D, then those pairings should appear in the original list, and be analysed & merged by the user. Thus there should be either two merged Records, or one merged Record with two unchecked Records, none of which should be placed in the Non-Duplicates Named List.
One 'problem' is that I have a pair of individuals which are 'possible' duplicates, but I don't know for sure. So if A and C are 'possible' duplicates, I can't merge them, but I don't want them to disappear off the result set either. If I were to add A to the list because he/she is not a duplicate of B, then I also will end up losing the result set listing for A and C. I would also lose the ability to compare A to future additions to my database because A would be in the list. I know I could create a new list each time, but I really wouldn't want to keep track of the 40 or more individuals that are currently in the result set that I know are not duplicates of at least one other person and add them to the named list each time.
I am hoping that by tweaking the scoring and adding more checks, only a handful of false positive Non-Duplicates will need to be placed in the Named List.
At the moment I have 61 pairs of individuals in the result set. Of these I have 0 sets of duplicates, 1 possible set of duplicates, and 60 definite non-duplicates. The 1 possible set of duplicates is on lines 42 of the result set.
The problem I have is that the analysis and merging has to be performed using the Result Set in FH after the Plugin has finished. So how does the user tell the Plugin which pairs of Records are Non-Duplicates?
Very good question. I must admit that I don't understand lua or the limitations that plugins work under.

When you select the two individuals from the result set and add them to the named list, they get added as two consecutive lines I believe. Could you code the plug in to have the individuals on lines 1 and 2 be a pair, lines 3 and 4 be a pair, etc.?

If that won't work, could you in some way build an external dataset that we could add the pair of individuals to and that dataset would be used by the plugin?

I don't want to appear too picky here. I think the plugin is doing a great job and really appreciate the hard work!

Thanks,

Bill

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 02 Jul 2012 10:07

I appreciate your thoughts and ideas Bill.
So if A and C are 'possible' duplicates, I can't merge them, but I don't want them to disappear off the result set either.
One way of keeping track of 'possible' duplicates like A & C, is to use Save Result Set As... and maintain an external file list of 'possible' duplicates.
I would also lose the ability to compare A to future additions to my database because A would be in the list.
Not true. Any additional Individuals, and any changes to Individuals NOT in the list, are compared against members of the list. It is only list member versus list member comparisons that are excluded.
When you select the two individuals from the result set and add them to the named list, they get added as two consecutive lines I believe.
That is true, and the Plugin could use that, but there is a snag. If A & B are 'non-duplicates' and added to named list, their paired order can be retained. But if A & X are also 'non-duplicates' and added to named list, then only X is actually added, because A is already in list, and the paired ordering breaks down.
Could you in some way build an external dataset that we could add the pair of individuals to and that dataset would be used by the plugin?
Yes, I have that in mind. The Plugin would retain its own Result Set list from the last run. When the Plugin is used again an option would be to Manage Non-Duplicates. It would display its retained Result Set list, alongside its Non-Duplicates list, and allow row pairs to be moved from Result Set to Non-Duplicates, or deleted from Non-Duplicates. You would need to have the FH Result Set on display as an aide mémoire, before using the Plugin, because you cannot alter FH while Plugin is running. Alternatively, you could run the Plugin from another instance of FH on the same Project, but this risks the two instances of FH trying to update the same Project data.

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 02 Jul 2012 10:18

Mike

Just a thought couldn't your store RIN's as pairs in the external non duplicates table. Then once the plugin has run do a compare on pairs of RIN's before displaying the dataset.

Or is this not possible or what you are already suggesting described differently?

One way may be to assign each pair a unique id and then do a lookup for that unique id before displaying the dataset rather than looking at individual RIN's.

Just some thoughts.

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 02 Jul 2012 14:08

Colin ~ You have described the process I plan to use, but just expressed differently.
The trickiest bit is designing the user interface to allow the pairs of Non-Duplicate Records to be conveniently managed.

On a point of information:
The unique id that I plan to use is what FH calls the Record Id.
This NOT the same as the GEDCOM RIN tag for the Automated Record Id.
Thus it would be better to talk about Record Id instead of RIN in such matters.

BTW: Plugin V1.4 mistakenly calculates the Generation Gap wrongly.
It uses the =RelationCode() function, whose description I misunderstood.
After some experiments, I now understand, so the next Version will be correct, and will reliably exclude very close relations (spouse, sibling, parent, uncle/aunt, grandparent, great-grandparent) from the Result Set.

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 02 Jul 2012 14:13

tatewise said:
Colin ~ You have described the process I plan to use, but just expressed differently.
The trickiest bit is designing the user interface to allow the pairs of Non-Duplicate Records to be conveniently managed.
Great minds think alike ;)
tatewise said:
On a point of information:
The unique id that I plan to use is what FH calls the Record Id.
This NOT the same as the GEDCOM RIN tag for the Automated Record Id.
Thus it would be better to talk about Record Id instead of RIN in such matters.
I will do in future. [oops]

User avatar
gerrynuk
Megastar
Posts: 565
Joined: 25 Apr 2007 09:21
Family Historian: V6
Location: Welwyn Garden City
Contact:

Find Duplicate Individuals

Post by gerrynuk » 02 Jul 2012 15:45

Just a word of encouragement to Mike - this is a great plugin and the speed at which you are developing it is truly outstanding.

Also to everyone involved in helping Mike - thanks for your contributions.

Gerry

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 02 Jul 2012 16:50

Gerry,

On your comment to Mike... I agree wholeheartedly. This is a great app that fills a real need. I have already found 6 duplicates that I didn't know I had and merged them.

Bill

User avatar
BillH
Megastar
Posts: 2183
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals

Post by BillH » 02 Jul 2012 16:55

Mike,

Thanks for the explanation of how the list curently works. I misunderstood.

The idea you are looking at for using the result set from the last run along with the Manage Non-Duplicates sounds very promising. I hope it isn't too difficult for you to program. Sounds like a great idea.

Thanks,

Bill

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 03 Jul 2012 09:35

Thank you all for your continued patient assistance and suggestions.
I have no genuine Duplicates in my data, so I rely on your data to assess the effectiveness and run time performance of this Plugin.

It would be invaluable if you could retain an old version of your data with genuine Duplicates, possible Duplicates, false positive Duplicates, sibling twins, etc.
It only needs to be a standalone GEDCOM, possibly retrieved from a Backup dated before this exercise started.
This will allow the Plugin enhancements to be assessed against a standard range of data.

I thank you in advance if you can help with this.

I think it is time to start a new thread, which I will do when Plugin V1.5 is available.

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals

Post by johnmorrisoniom » 03 Jul 2012 10:17

Hi Mike,
A lot of my false positives are where the Forenames are the same, and birth year (but not place (one is Andreas, Isle of Man, the other is Malew, Isle of Man)), but the surnames are not even close (such as John James Teare and James John Clucas).

avatar
Dagwood
Superstar
Posts: 302
Joined: 30 Nov 2009 17:37
Family Historian: V6.2

Find Duplicate Individuals

Post by Dagwood » 03 Jul 2012 10:48

tatewise said:
Thank you all for your continued patient assistance and suggestions.
I have no genuine Duplicates in my data, so I rely on your data to assess the effectiveness and run time performance of this Plugin.
It is we who have to thank you Mike!
I have been finding a mixture of duplicates I did not know were there and false ones which have led me check facts I probably would not have reached for ages. Version 1.0 gave me a list of about 12 ancestors of whom about 4 pairs were duplicates version 1.4 gave me a new list of about 24 pairs some of whom were fresh duplicates; a few were almost identical eg cousins born same year ,same or similar name, same village; and some that just did not look at all like duplicates and could be ignored immediately.
Runtime?I get about 900 records checked in approx 5 seconds.
A really useful plug-in.
Many thanks
Dagwood[smile]

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals

Post by Valkrider » 03 Jul 2012 11:32

Mike

You are more than welcome to a copy of my Gedcom with some genuine ones as well as ones that aren't including the wrong gender ones. It has around 1700 individuals in it.

How do I get it to you?

User avatar
tatewise
Megastar
Posts: 27081
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals

Post by tatewise » 03 Jul 2012 14:05

Thank you for the offer Colin, but for the moment I would prefer users like you to assess the Plugin against your own data, and assess its user interface from an independent perspective.

Locked