* Find Duplicate Individuals Version 1.5+

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 03 Jul 2012 23:22

The Find Duplicate Individuals Version 1.5 is now available for download.

This adds several new features as described in the Help page.

It corrects a mistake in calculating Generation Gap and now also removes immediate relatives such as Siblings, and Parent/Child from the results.

It corrects a mistake in checking Event Date Chronology and now also removes glaring mismatches from the results, such as where one Individual was Married & Died before the other was Born & Baptised.

It now not only checks Individual Event Dates, but also those of the Father, Mother, Spouse, and 1st Child.

A Diagnostic Mode has been added to aid analysis of the points scoring system.

The Non-Duplicate Management feature previously discussed will have to wait for another day as I have other pressing commitments at present.

ID:6362
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals Version 1.5+

Post by Valkrider » 04 Jul 2012 07:37

Mike

Thanks for this.

I have just tried it out. It has found a few more than 1.4 some new ones (I suspect lower scores before as they aren't new records). The gender match that I had before is now gone.

I will spend some more time this afternoon looking at more records and let you know.

avatar
Dagwood
Superstar
Posts: 302
Joined: 30 Nov 2009 17:37
Family Historian: V6.2

Find Duplicate Individuals Version 1.5+

Post by Dagwood » 04 Jul 2012 09:42

Hi Mike, I was about to ask if non-duplicates could somehow be excluded but saw your thread re v1.5 and have read your notes re a named List ie:
'In addition, any Individual Records placed in a Named List called 'Non-Duplicates' will automatically be excluded.'

I have tried this on a pair of names but they keep coming back each time I run it. I have tried the variations of naming the list, ie with and without single and double quote marks, but cannot get them to be excluded.

I can't see what I'm doing wrong, any thoughts
Dagwood

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 1.5+

Post by Jane » 04 Jul 2012 10:31

It's working for me, make sure the name of the list is

Non-Duplicates

It must be exactly as above try cutting and pasting the name in.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
LornaCraig
Megastar
Posts: 2996
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals Version 1.5+

Post by LornaCraig » 04 Jul 2012 11:07

Mike,

Thanks for your continued work on this.   I think it is unlikely that I have any genuine duplicates in my data but it is interesting to see how the plugin works and what suggestions it makes.

As you said it one of your other posts, it is likely that whatever scoring system is used there will be some false positives which get a higher score than some genuine duplicates.   A low score for genuine duplicates is sometimes unavoidable because the records for a duplicate pair may be fully compatible (in that there is no conflicting data) but if the data for one or both of them is scarce there may be few actual positive matches in the data to move the score in a positive direction.  For example I have a pair of records which I entered as separate individuals but I linked using ‘Associated Person’ and added notes to both to explain that I thought they might be one and the same person.  The evidence is fairly compelling but is purely circumstantial, recorded in notes rather than hard facts.  Therefore the only actual match in their data is the name, and they get a low score.  No refinement in the scoring system could change this (although I think perhaps they should not lose any points for a difference in child count: see below).

However there are some false positives getting a higher score that this pair, which shouldn’t.   In these case there are a number of positive matches in the data but a single chronological incompatibility which ought to outweigh all the positives because it makes a match impossible.  Here are two examples:

1.  A and B have similar names.  There are no event dates for A, but she was married and there is a birth date for her husband.  Therefore her marriage date must be later than her husband’s birth date.  But she is being matched with B, who died 60 years before that date.

2.  C and D have similar names and the same number of children.  There are no dates for C himself but there are baptism dates for his children, in the 1640s.  Therefore his own birth date must be before the 1640s.  But he is being matched with D who was born more than 300 years later.

Anomalies like these could be removed if the chronology checks were extended where an individual’s own dates are not known. Dates from their immediate family could be used to place constraints on the individual’s own dates.    I do realise, of course, that adding more checks will slow down the run time of the plugin.  A compromise has to be reached, but at present it is checking my 3533 individuals in just 9 seconds (46 seconds in diagnostic mode) so I have no problem with run time.

One other observation: I’m not sure that it’s a good idea to deduct points if the child count differs.  In a typical case of duplicates, one of the records will have been in the gedcom file for a while and will have a fairly full set of family members recorded, e.g. from census returns.   When a contemporary individual with the same name is found, the name might have turned up in a different type of document, such as an employment record, where no information about family would be recorded.  The lack of family information means that no additional points are gained, but equally it should not mean that points are lost.

Thanks again for all the time you are giving to this plugin – it must be a labour of love!
Lorna

avatar
Dagwood
Superstar
Posts: 302
Joined: 30 Nov 2009 17:37
Family Historian: V6.2

Find Duplicate Individuals Version 1.5+

Post by Dagwood » 04 Jul 2012 11:25

Jane said:
It's working for me, make sure the name of the list is

Non-Duplicates

It must be exactly as above try cutting and pasting the name in.
Tried again and copied and pasted this time as you suggested Jane. Still both names appear as duplicates in the list.
So far I have tried as you suggested,with quotesand with single quote marks,just one name ,and both names. Every time the pair appear back on the list. I've even tried re-downloading this version and repeating it over again.
I can't think what might be different to what you and others are doing.
Dagwood

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals Version 1.5+

Post by Cambiz » 04 Jul 2012 11:35

Image

I selected 8 individuals and ran

avatar
Cambiz
Famous
Posts: 235
Joined: 26 Sep 2003 23:30
Family Historian: None

Find Duplicate Individuals Version 1.5+

Post by Cambiz » 04 Jul 2012 11:40

The Non-Duplicates list existed but was empty on the first run.

I added all bar the 8 records to Non-Duplicates and it appears to be running ok that way.

User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 04 Jul 2012 11:40

Lorna ~ Thank you, those are some useful tips.

Dagwood ~ Are you sure it is the Non-Duplicates pair that are listed in the Result Set.
Carefully check the Record Id.
The two Individuals may still appear individually in the Result Set paired with other candidates.

Chris ~ Thanks for the error report - I'll check into it.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 04 Jul 2012 12:35

The problem is now fixed Chris and a new V1.5 dated 4 Jul 2012 is in the WiP download. ~ Sorry.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
RogerF
Famous
Posts: 182
Joined: 26 Apr 2009 16:32
Family Historian: V6.2
Location: Oxfordshire, England
Contact:

Find Duplicate Individuals Version 1.5+

Post by RogerF » 04 Jul 2012 13:43

I'm a bit confused by the version numbering; today's version is 1.6.

I think the scoring is perhaps still a bit adrift? I have a Mary Ann FIRTH and an Ann FIRTH coming 14th in my result set. They have the same year of birth -- 1817(app) -- but that's it. Different place of birth, completely different christening data (+4), and different parents (+3, +3). I wouldn't expect them to be so near the top of the list... unless I really don't have many (or indeed any) duplicates in my 12000 records. I /did/ have three genuine dups, which 1.4 found -- many thanks for that. When fully developed, this will prove a fabulous aid!
Roger Firth, using FH to research the FIRTHs of Lancashire and Yorkshire, and the residents of the market town where I live.

avatar
TimTreeby
Famous
Posts: 168
Joined: 12 Sep 2003 14:56
Family Historian: V6.2
Location: Ogwell, Devon
Contact:

Find Duplicate Individuals Version 1.5+

Post by TimTreeby » 04 Jul 2012 15:03

Hi Mike,
2 things

1)I think the points system needs tweaking slightly especially regards to places as the Raymont's score higher than the Dumble's ebven though there is more place name discrepencies.

Image

and how scored

Image

2) Regarding runtime of query, it does make a lot of difference for Runtime as to the spec of the machine.

i.e. Main PC - Windows7 64 Bit - 4GB Ram - Intel Core 2 Duo @ 3.06GHz takes 2 mins 5 Secs. Keeps CPU @ 50% appx so easy to do other things at same time.

Laptop - Windows XP 32 Bit - 1GB Ram - Intel Pentium M @ 1.4GHz takes 11 mins 35 secs. Keeps CPU @ 99% appx so hard to do anything else at same time.

Same GEDCOM of 10606 people.

Don't think you can do much about the speed but just so you can put a warning in maybe that lower spec machines will take a lot longer to run.

User avatar
Valkrider
Megastar
Posts: 1534
Joined: 04 Jun 2012 19:03
Family Historian: V7
Location: Lincolnshire
Contact:

Find Duplicate Individuals Version 1.5+

Post by Valkrider » 04 Jul 2012 17:15

Mike,

Just FYI v 1.6 did not replace v1.5 for me. As you can see from the screenshot it added a second instance with -1 at the end of the plugin name.

It seems to be taking a little longer to run than v1.4 but not significantly.

It is throwing more duplicates than before and virtually all of them are not. It does not seem to be respecting place of birth (if it should) I am getting Aberdeen matched with Canterbury purely because they are the same year.

The good news is that each refinement seems to find at least 1 genuine duplicate that the previous version missed.

Thank you once again for developing this.

Image

User avatar
BillH
Megastar
Posts: 2184
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 1.5+

Post by BillH » 04 Jul 2012 19:19

Mike,

The exclusion of siblings and parent/child individuals is working great in version 1.6.

Like Colin, I am seeing more non-duplicates than before. Usually this is because one forename is the same, but not usually the same one. For example, I am seeing a lot of pairs something like this: James Kenneth Henshaw and Alex James Henshaw. I went from 61 to 98 pairs and I'd guess that at least half of them are like this. Can the order of the names be considered and something deducted if they are not the same?

I have one pair, Margaret Ann Strader and Julia Ann Margaret Hunsicker. The forenames both have Ann and Margaret in them, but the first name and surname are different. This pair shows up 4th on my list. This pair both have husbands that have the last name Henshaw.

Both women have the same number of children. I wonder if too many points are being given for having the same number of children, in this case 4.

Also, for the first couple, the first child is named Nancy Lee Henshaw (a girl). For the second couple the first child is named Marion Lee Henshaw (a boy). The pair is getting 6 points under Child 1 because the two children both have Lee and Henshaw in their name even though their first names are different and they are of different gender.

So, all together this pair is getting 19 points.

All in all a great plugin. Thanks for all the hard work!

Bill

User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 04 Jul 2012 21:40

Sorry about the mix up regarding V1.5 and V1.6, but all should be OK now at V1.6 4 July 2012. Slightly too much haste in posting the fix for Chris, which would affect everyone using the subset selection option.

This version has some more date chronology checks using relation's dates where the individual has none, as suggested by Lorna, because they were easy to add.

A number of you are suggesting that Child Count is doing more harm than good, so I am tempted to remove it, now that there are so many other more significant checks.

Roger said
I have a Mary Ann FIRTH and an Ann FIRTH coming 14th in my result set. They have the same year of birth -- 1817(app) -- but that's it. Different place of birth, completely different christening data (+4), and different parents (+3, +3). I wouldn't expect them to be so near the top of the list... unless I really don't have many (or indeed any) duplicates in my 12000 records.
By my reckoning that pairing scores about 22, which with around 250 points now on offer, is very low. You say 'completely different christening data (+4), and different parents (+3, +3)' but there must be something small in common for some points to be scored. You have hit the nail on the head with 'unless I really don't have many (or indeed any) duplicates'. Then the Plugin will insist on listing the highest scoring (but very low points) false positives.

Tim ~ A similar argument applies to your low score listings. Also the difference is only 5 points out of a maximum of 250+. Perhaps the Plugin should give percentage scores as well as points scores?

I had assumed larger databases would run on more powerful PC, but in the published version the 'Help & Advice' could say something about performance.

Colin ~ I can't explain the download problem. Those issues are usually associated with browser behaviour.

I suspect more again of the above regarding longer listings. The Plugin does take account of Place, but only adds points if there is some agreement (and only if Dates are similar), rather than deduct points if Places differ.

Bill ~ More of the same regarding low scores. The tip about Child Gender mismatch is good. The way the Plugin works makes checking the order of names tricky. I could award more points for a matching Surname than a matching Forename. As I said above, I may drop Child Count.

The name checking is deliberately fuzzy, because if genuine duplicates come from different sources, then the names may be slightly different, in a different order, or parts missing, or even Forenames & Surnames swapped. The latter can happen on Marriage or on Adoption.

What may have gone unnoticed is that as more checks have been added the maximum score has gone up and up. So what was a 'good' score in V1.0 may now be a 'poor' score in V1.6, so percentage scoring may help make this clearer.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
LornaCraig
Megastar
Posts: 2996
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals Version 1.5+

Post by LornaCraig » 04 Jul 2012 22:00

Mike,

I've just seen your latest post and note that you have now included some more chronology checks using relations' dates.  Thank you for such a quick response!  I don't have time to look at the effects of the changes now but will try it tomorrow.

Meanwhile, following on from one of the points in my earlier post, about not deducting points if there is a difference in child count,  I think there may be a similar problem with relatives’ names.
 
As I understand it, if there are names recorded for the father, mother, spouse or first child of each of the people being compared, but the names of those relatives don’t match, 10 points are deducted.   This is sensible for names of parents but may not be appropriate for spouses or first children.   One of the reasons why duplicates sometimes go undetected is that the same individual may have had more than one marriage and more than one set of children, and may have been entered into the gedcom as two different people, one with each family.    If the spouse and children’s names don’t match, the plugin will deduct 20 points (10 for spouse non-match, 10 for child non-match), even though the individuals could be genuine duplicates.  

I think it may be better to stick to adding points for matching relatives’ names and deducting points for non-matching parents' names, but not deduct anything for non-matching spouse names or first child names.

I note what you say about the maximum number of points having increased a lot, and agree that a percentage score would be helpful.
Lorna

avatar
TimTreeby
Famous
Posts: 168
Joined: 12 Sep 2003 14:56
Family Historian: V6.2
Location: Ogwell, Devon
Contact:

Find Duplicate Individuals Version 1.5+

Post by TimTreeby » 04 Jul 2012 22:34

Definitely think Percentage would be better than a raw score. But i think you do have a problem with your newer version. Reason for this is that the matches i got before which were duplicates either do not know show or come way down the list.
List from V1.2
Image

List from V1.5
Image

As you can see the Elizabeth Hancocks are now way down the list also doesn't seem to match Elizabeth Hancock's parents even though they were duplicates as well, and the Lyle Boundy Smiths don't even show. This are definite duplicates.

Image

User avatar
BillH
Megastar
Posts: 2184
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 1.5+

Post by BillH » 04 Jul 2012 22:35

Mike,

Even though there may be a 'maximum' of 250 points or so, my actual duplicates are scoring 36 or less, so they get lost in amongst the non-duplicates which are scoring 20 or less. 36 is the most points of any pair on my list.

I think dropping the child count would be good.

Is there a way to separate first name from other forenames? If so, could we see an option to only include first name and not other forenames? This might help me eliminate over half of my 98 non-duplicates. I know everyone wouldn't want to do this which is why I think a user selectable option would be good.

Bill

User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 04 Jul 2012 22:59

I think the next version should drop Child Count.

Also I think it should include Percentage scores (as well as points).

Lorna's comments about mismatching relatives' names also crossed my mind. A similar argument could even be made for parents, where a person is adopted or fostered. However, -10 or -20 points is now less than 10% of high scoring duplicates. But maybe reducing to -5 points per relative may be a compromise.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
BillH
Megastar
Posts: 2184
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 1.5+

Post by BillH » 04 Jul 2012 23:09

Mike,

I think Lorna's idea of mismatching parents is a great idea. Many of my non-duplicates could be dropped from the list if the parents names were compared. Maybe a lot of points can be deducted if the parents don't match up.

Bill

User avatar
LornaCraig
Megastar
Posts: 2996
Joined: 11 Jan 2005 17:36
Family Historian: V7
Location: Oxfordshire, UK

Find Duplicate Individuals Version 1.5+

Post by LornaCraig » 04 Jul 2012 23:26

Bill,

My understanding is that the parents names are already being compared, and points deducted for a non-match as well as being added for a positive match. This is what the info says:

'The name matching is not only performed for the pair of Individuals, but also for their Father, Mother, Spouse, and Child relatives. Although, at present, only the first instance of each of these relatives is assessed. Thus the maximum score for five good Name matches is 50 points.
If both relatives exist, but their names have no matches, then 10 points are deducted.'

My point was that while deducting points is appropriate for non-matching parents' names it is less appropriate for non-matching spouse or children's names. An individual who has two spouse-families may have been entered as two separate people. The fact that the spouse and children's names don't match is compatible with them being duplicates.

As Mike has now pointed out, it may not even be appropriate to deduct points for non-matching parents' names, because of cases of adoption and fostering.

The further we look into this the more complicated it gets....
Lorna

User avatar
BillH
Megastar
Posts: 2184
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 1.5+

Post by BillH » 05 Jul 2012 00:29

Hi Lorna,

Sorry for misstating what you were saying.  I was confused.  [oops]

I agree that deducting points is less appropriate for non-matching spouse or children's names.

I guess my problem is that I have almost 100 non matches on my report and most of them are being reported because the two individuals have similar, but not matching names.  Maybe there is a common middle name or the two have their first and middle names reversed.  For all of these, the parents don't match up.  

While I agree that there might be an adoption involved, that is rare compared to the number of situations where there is no adoption and the two are really not duplicates.

I was hoping to have a way to eliminate these from the report as they tend to hide my actual duplicates.  Maybe there could be a user selectable option to exclude the pair from the report if the parents don't match up?  

Maybe too many points are being given for the parts of the name that are a match and too few for non matches maybe. The points deducted for a non match have to overcome any points added for the part that matches.

I would think that if the mother or father has a different surname, then I would say it isn't a match and the pair should not be candidates for being duplicates. I guess I would also think that if only the surnames match, but not the forenames, they would not be matches either.

It is true that it gets more and more complicated.  I can't believe that Mike has put together such a great plugin in such a short time.  (Actually, I guess I can based on his prior work, especially the Map Life Facts plugin. [smile]).

Bill

avatar
Dagwood
Superstar
Posts: 302
Joined: 30 Nov 2009 17:37
Family Historian: V6.2

Find Duplicate Individuals Version 1.5+

Post by Dagwood » 05 Jul 2012 15:36

tatewise said:
Dagwood ~ Are you sure it is the Non-Duplicates pair that are listed in the Result Set.
Carefully check the Record Id.
The two Individuals may still appear individually in the Result Set paired with other candidates.
An odd one this Mike. I checked what you suggested and the pairs were still there. I tried about 3-4 times more with no alterations made after checking Named List was correct. This morning I tried again and the names were removed from the duplicates list. I don't think anything was altered but at least it appears to be working ok now.
Just one thought. If a pair of names appear on the Named List and later a similar name to one or both is added to the records will it get missed as a possible duplicate because of its other half still being on the list?
Dagwood

User avatar
tatewise
Megastar
Posts: 27082
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 1.5+

Post by tatewise » 05 Jul 2012 16:12

Were you getting the No Duplicate Individuals Found message, while the Named List names were still displayed?

No, new extra similar names will NOT get missed.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
Dagwood
Superstar
Posts: 302
Joined: 30 Nov 2009 17:37
Family Historian: V6.2

Find Duplicate Individuals Version 1.5+

Post by Dagwood » 05 Jul 2012 16:54

tatewise said:
Were you getting the No Duplicate Individuals Found message, while the Named List names were still displayed?

No, new extra similar names will NOT get missed.
Mike,
I don't recall seeing that at any time as there were a number of duplicates on each list I have displayed. If it happens again I'll check.

Just for my clarification, does that mean that if I had say three Tom Bodles and two had been put on the Named List the addition of the third, new, record would result in potentially two sets of duplicates being displayed even though the first two are already on the list and therefore will not be displayed as duplicates?

Dagwood

Post Reply