Page 2 of 4
Find Duplicate Individuals Version 1.5+
Posted: 05 Jul 2012 21:57
by tatewise
Yes.
Assume originally Tom Bodles (A) matched Tom Bodles (B) in the Result Set and the pair were placed in the Non-Duplicates list.
Then you add Tom Bodles (C), and assuming enough details match then the Result Set should include:
Tom Bodles (A) matches Tom Bodles (C)
Tom Bodles (B) matches Tom Bodles (C)
Find Duplicate Individuals Version 1.5+
Posted: 05 Jul 2012 22:29
by LornaCraig
Mike,
The chronology checks in V1.6 seem to be working fairly well but could probably still be extended. The plugin is still trying to match (admittedly with a low score) an individual whose children were baptised in the 1640s with an individual who was born in 1945. The date of baptism of the first individuals children could be used to assign him a birth date before 1640, but no points have been deducted for a non matching date.
It might be a good idea to assume a lifespan of no more than 100 years (with apologies to anyone who has very long-lived ancestors). This would mean that if a birth date is known but there is no death date, the plugin would not try to match the individual with someone who has some life events more than 100 years later. Conversely if a death date is known but no birth date, it would not try to match the individual with someone alive more that 100 years earlier.
On the question of deducting points for non-matching relatives names, as you said in an earlier post -10 or -20 points is now less than 10% of high scoring duplicates. However I suspect that in practice very few duplicates will get scores higher than about 50 and some may get as little as 10. If they scored as much as 200 they would have so much in common that they would probably already have been noticed and the records merged. Some genuine duplicates will inevitably get low scores. There may not be many positive matches in the data, perhaps because the data for one of them is scarce or, crucially, because an individual with two spouse families has been entered as two different people. It is the job of the plugin to show that the records are compatible even if they have a low score. Deducting points for non-matching spouse or child names could wipe out the low score. I am still inclined to think that it is enough to add points for matching spouse and child names, without deducting points for non-matches.
Find Duplicate Individuals Version 1.5+
Posted: 06 Jul 2012 16:25
by TimTreeby
Mike,
I have found a slight problem with your date checking Algorithim. This I believe is due to you treating dates as ranges but not considering (bef), (aft) and (app) dates.
I.e in your explaintory notes you say
Any single Date is treated similarly, so 1 Feb 1777 starts & ends on 1 Feb 1777, whereas 1666 starts 1 Jan 1666 and ends 31 Dec 1666.
This i think then leads to a date of (bef) 1799 being treated as 1st Jan 1799 - 31st Dec 1799
and a date of 1796 (app) being treated as 1st Jan 1796 to Dec 31st 1796. This would lead to NO MATCHES, and therefore deducting points and missing possible matches or reducing the score for probable matches. Unless i have misunderstood how the date matching works.
If it is possible i would suggest the following which would overcome this
(bef) : set range from (year-10 years) to year
(aft) : set range from year to (year+10 years)
(app) : set range to (year-5 years) to (year+5 years)
I have a Gedcom file where i can prove this to be the case if you want a test file, have extracted just the Boundy's as shown in Diagram on my previous post. If i change the dates around then the Elizabeth Hancock's get a much higher score and then the John Hancock's show as matches as well as Jeneffe Boundy to Jenefee (Hancock).
Find Duplicate Individuals Version 1.5+
Posted: 06 Jul 2012 19:56
by tatewise
Tim ~ Progress on the Plugin is a bit slow at present, because of my other commitments.
However, your conclusions are absolutely correct, and I had already created a few individuals based on your tree diagrams above, and implemented the Before/After (and From/To) exactly as you propose.
I was not sure whether to use +/-10 years, or +/-100 years, or even as near infinity as FH would allow, but you have plumped for the same as me.
This does result in increased scores for the Hancocks ~ Elizabeth HANCOCK gets 46 points, and John gets 30.
I am not so sure about Approximate (or Calculated) dates. What do others think?
How should I evaluate for example 4 July 1888 (Approx) or May 1777 (Approx) as opposed to 1666 (Approx)?
On another topic suggested by Bill, I have revisited Forename checking and implemented a scheme to take into account their position.
I won't go into the algorithm details, but the Name scoring is as follows and easily tweak-able:
7 points for SURNAME match.
6 points for Forename match in correct position i.e. both 1st, or both 2nd, etc.
3 points for Forename match but different position.
2 points for Soundex match, excluding all the above.
Knowing this, if users suspect an individual with multiple Forenames is a duplicate, then creating Alternate Names with the Forenames in different positions may find an elusive match, i.e. John James SMITH as well as James John SMITH.
Tim ~ This new scheme boosts the score for the Lyle Boundy SMITH pair to 19 points together with the other tweak below.
The biggest problem with this pair is that they are 1st cousins and suffer a Generation Gap deduction, but are not such close relatives as to be removed from the results.
I had assumed that duplicate close relatives like this would soon be spotted by user inspection of family trees.
I don't understand why the siblings Lena BOUNDY and Harry BOUNDY have their parents separated by a blue clone ribbon.
Also the matching is inhibited by making Lyle Boundy SMITH a son of Harry BOUNDY i.e. the wrong Surname.
Nevertheless, now that immediate relatives are removed from the results, I have tweak the deductions for remaining close relatives to only -15, -10, or -5.
I am beginning to wonder if the published Plugin needs preference options to tweak your own points, but that is for another day.
Find Duplicate Individuals Version 1.5+
Posted: 06 Jul 2012 21:19
by TimTreeby
Hi Mike,
If it was just me then for (app) and (cal) dates i would probably go for if just year +/- 5 years, if month and year +/- 5 months and if a full date +/- 5 days.
Although can't think of too many occasions where a full date would then be put in as (app) or (cal), so i don't think that would matter too much, but that is just me.
Regarding the Boundy's and the Blue Ribbon is because the trees is done as two Ancestor diagrams of the two Lyle Boundy Smith's i have, should be the same person as was born to William John Smith & Lena Boundy but then brought up as Harry Boundy's Son. Not sure if an official adoption or just raised as his Son.
Tim
Find Duplicate Individuals Version 1.5+
Posted: 06 Jul 2012 21:36
by RogerF
Mike said:I am beginning to wonder if the published Plugin needs preference options to tweak your own points, but that is for another day.
Personally, I feel Preferences to be overkill; I suspect relatively few users will feel the need to tweak. What
would be useful, for that minority, would be to have all of the scores defined as well-commented constants at the head of the Plugin, so that tweaking would be achieved by clearly-defined editing of the Plugin source.
Find Duplicate Individuals Version 1.5+
Posted: 06 Jul 2012 22:55
by tatewise
Roger ~ Yes, that would be a good method, and SOOOOO much easier for me to implement!!
Tim ~ So my argument about close relatives like this being spotted as duplicates in the tree diagram is valid?
I presume you already knew about this duplication before using the Plugin?
If the Plugin misses such duplicates it is no great problem.
It is the more elusive duplicates that are important to find.
This case is also an example of the adoptive parents scenario discussed earlier, so the argument to never deduct points for mismatching Names is getting stronger.
Implementing the Approx/Calc dates should be straight forward, although it all adds a bit to the run time.
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 00:12
by BillH
Mike,
Thanks for implementing something for situations where the forenames are in different orders. I have a thought though. If the surname is the same and the forenames are in different orders, this would still allow 10 points. I would like to see forenames in different orders actually reduce the points. Just a thought.
I would still like to be able to deduct points or eliminate individuals who have one or both parents with mismatches, especially in the surname. I have very few (3 out or 9000) adoptions in my tree. Shouldn't the plugin handle the more common scenario more so than the more rare scenario? Maybe as an option?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 07:18
by Valkrider
Mike,
I want to re-read and fully understand your suggestions yesterday before I answer.
In between times I agree with Bill just because the forenames are the wrong way round this should not deduct any points as I have several instances of christening order being reversed in later life (particularly in census) but the correct way round on marriage certificates. I think if the two forenames are correct but in the wrong order nothing should be deducted.
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 13:44
by tatewise
Bill ~ I forgot to mention that the limit for Name matches will be raised to 20 points, as Name matches are now less likely to swamp the results due to all the other extra scoring compared with early versions of the Plugin.
Also the Individuals that mostly hit the limit were close family relations that are now excluded.
The scheme is consistent with the points for good Forename matches going up from 3 points to 6 points, and 7 points for Surnames.
So in effect, 3 points for a Forename in wrong position, is a deduction of 3 points.
There would have to be more than 4 such out of position Forenames, and a matching Surname, to hit the 20 point limit.
Bill said:
I would like to see forenames in different orders actually reduce the points.
Colin said:
Just because the forenames are the wrong way round this should not deduct any points.
To cope with this, I will implement Roger Frith's suggestion of values at the head of the Plugin script that users can edit as required.
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 17:07
by BillH
Mike,
That will work well I think.
Having the ability to change the values around will help. Each person can then make the values what works for them.
Will there be a way to deduct points for pairs of individuals where one or both of the parents surnames don't match up?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 18:15
by Valkrider
Mike
I have now had a think about your question of yesterday.
I am not so sure about Approximate (or Calculated) dates. What do others think?
How should I evaluate for example 4 July 1888 (Approx) or May 1777 (Approx) as opposed to 1666 (Approx)?
If I call them option 1, 2 and 3 in the order that you posed the question.
For 1: I would suggest that + or - one calendar month would get a high score say 7 +/- 3 months would get say 3 and more than that 0
For 2: I would suggest +/- 3 months would get a score of 7 +/- 6 months would get 3 and more than that 0.
For 3: would suggest +/- 2 months would get a score of 7 +/- 18 months would get 3 and more than that 0.
These are my thoughts looking at my datasets and considering registration quarters. The only fly in the ointment may be the UK 1841 census where the ages were rounded and option 3 may need tweaking as a result.
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 21:29
by tatewise
In all these assessments, may I remind you all about a few techniques that the Plugin uses for ALL Event Dates, as described on the WiP help page.
Every Date is assigned a Timespan.
e.g.
4 July 1888 Timespan is 4 July 1888 to 4 July 1888 i.e. 1 day
May 1777 Timespan is 1 May 1777 to 31 May 1777 i.e. 1 month
1666 Timespan is 1 Jan 1666 to 31 Dec 1666 i.e. 1 year
Q1 1666 Timespan is 1 Jan 1666 to 31 Mar 1666 i.e. 3 months
Between 1660 & 1670 is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years
After 1660 Timespan is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years (in next Version)
When comparing two Dates the following applies.
If the two Start of Timespan Dates are less than 50 days apart then 2 points are awarded.
If the two End of Timespan Dates are less than 50 days apart then 2 more points are awarded.
If the two Timespans overlap at all then 2 extra points are awarded.
The 50 days was chosen to allow Dates such as 20 Dec 1665 or 20 Jan 1666 or Feb 1666 to score well against a Quarter Date such as Q1 1666.
So when it comes to dealing with Approximate/Calculated/Estimated Dates, there is already some tolerance built in, and on reflection, maybe only such Dates with no Day nor Month (i.e. Year only) need their Timespan increasing by say +/-5 years.
Find Duplicate Individuals Version 1.5+
Posted: 07 Jul 2012 22:33
by LornaCraig
I wonder whether a timespan of 11 years is enough for the BEFORE and AFTER dates? In your example, after 1660 is interpreted as 1 Jan 1660 to 31 Dec 1670. If it is known that someone was alive in 1660 his/her death might be recorded as 'after 1660', but they could have lived until 1680. Where dates exist but do not match, 10 points are deducted. This could mean that when the individual is compared with someone whose death is recorded as 1680, 10 points are deducted!
On the other hand, perhaps if the dates exist but differ greatly (say by 50 years or more) then more than 10 points should be deducted, to outweigh the fact that the maximum number of possible points has increased a lot. If two people lived 100 years apart there is no question of them being duplicates.
Find Duplicate Individuals Version 1.5+
Posted: 14 Jul 2012 17:27
by tatewise
The
Find Duplicate Individuals Version 1.7 is now available for download.
It incorporates many of your excellent suggestions ~ see the
WiP Help page for details.
I am sure you will let me know how it performs against your data.
In particular the
Chronology checks are much more extensive and
Synthesised time-span Dates are used where
Real Event Dates are missing.
User Preference Settings exist at the head of the Plugin to allow you to experiment by editing the points scoring values, etc.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 00:14
by BillH
Mike,
If I understand correctly, the only way to update the preferences is to actually edit the plugin source code. This means it would have to be re-done each time there is a new version. Any chance this could be put into an external dataset that we could update and it could be read into each successive version of the plugin?
Also, I'm not seeing how to change the value given or deducted when two individuals have a mismatch in father or mother or both. Am I just missing it?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 00:41
by tatewise
An external dataset is a good idea, but starts to get tricky if the User Preference Settings change.
Why not keep your settings in a text file, and Copy & Paste into the script as necessary.
Once published, the Plugin won't change as often as it does now.
You can even Rename the Plugin by adding a suffix such as Bill, and it will still work OK.
Then you can Copy & Paste from your version into the next downloaded version, and Rename again.
Remember, the two Individuals, the two Mothers, Fathers, Spouses, and 1st Children are all compared in the same way, by matching their Names and Events.
So, the Names, Event, Dates, and Place Points are the ones to change.
e.g.
IntNamesDeduction is the deduction for a Name mismatch.
IntEventDeduction is the deduction for an Event mismatch.
None of the points are associated with any particular relative, except perhaps the Generation Gap and Gender scoring.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 03:41
by BillH
Mike,
OK... thanks.
Bill
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 04:20
by BillH
Mike,
I have a follow up question.
How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 11:50
by LornaCraig
Mike,
V7 is looking good. The chronology checks are very sophisticated and you have obviously put a lot of work into it!
The ability to customise the points scoring makes the plugin very versatile. Two comments:
1. I note from your reply to Bill that when comparing names of family members, points are not associated with any particular relative. I dont know if it would be possible, but could the scoring for the matching of parents names be handled separately from spouse and 1st child names? I dont want to deduct points for non-matching spouse or child names in case the same individual had two spouse families, but would like to deduct some points for non-matching parents names (although I realise this may obscure cases of adoption or fostering). I think this is probably what Bill would want to do, from what he said in his earlier posts.
2. Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates? (BTW: there is a minor typo: Extimated)
I am happy with the 10 year span for the latter group but would like to extend it to more than 10 years for the before and after dates. I sometimes record a death date as After xxx if I know the individual was still alive then (e.g. they witnessed a marriage or were informant for a death) but I have no trace of them after that. They could still have lived for several decades, perhaps on the other side of the world but maybe just in the next town, in a bigamous marriage! I dont think points should be deducted or a chronology check failed if they are compared with someone who was alive 20 years later. It would be helpful to be able to set the before/after date spans to as much as 30 or 40 years, while leaving the 10 year span for estimated/approximate/calculated dates.
Apart from these two minor enhancements it would be difficult to think of a way of improving what is now an excellent plugin. In my file the majority of positive scores are only 4%, mainly due to matching or similar names and nothing else, so the various event and chronology checks are working well.
Thanks for all your work on this.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 18:20
by BillH
Mike,
I agree completely with Lorna. It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 19:08
by tatewise
Bill asked
How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of
IntLastNameScore,
IntForeNameRight and
IntSoundexNames, which can be zero, and
IntForeNameWrong that can even be negative.
Lorna asked
Could the scoring for the matching of parents names be handled separately from spouse and 1st child names?
and Bill asked
It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Yes OK, I will add separate
IntNamesDeductIndi=0,
IntNamesDeductFath=-5,
IntNamesDeductMoth=-5,
IntNamesDeductSpou=0,
IntNamesDeductChld=0 in the next Version.
Lorna asked
Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates?
Yes OK, I will have
IntDatesTimespan=50 and
IntDatesVariance=5 for these two values respectively.
All these little tweaks are progressively slowing the Plugin down again, but is it still running fast enough, especially on the larger databases?
I have tried to incorporate many of the checks you have suggested to identify your
Real Duplicates and eliminate the
False Candidates.
Are there any
Real Duplicates the Plugin is still
NOT finding?
Are there any
False Candidates the Plugin should be eliminating?
I still have the
Non-Duplicates Management trick up my sleeve.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 19:30
by johnmorrisoniom
I'm still getiing quite a lot of 'Matches' with the same fornames but different Surnames and born in different places.
However the highest score I've got for anyone with 1.7 was 50 and most only score about 20 or lower.
I had one genuine duplicate with identical names, and DOB within 1 year, that only got a score of 13,probably because there was no place info for one of them. Even the parents mached up.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 20:03
by tatewise
John ~ Could you post the details of a few of the highest scoring
False Candidates, and details of the low scoring
Real Duplicates that should be scoring more than
13 if the
Individual Names and both
Parents match, unless there are also some mismatching
Events.
With regard to your
Save Result Set Wish List Request, I will add an option to re-display the previous
Result Set.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 20:51
by BillH
Mike,
I'm still getting a lot of non-duplicates on my report. Many of these are high on my report mixed in with the actual duplicates. Most of these would be eliminated or dropped way down if the mis-matched parents names didn' result in so many points.
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.
If the new options you are adding work the same way, then will it really help all that much?
Since IntLastNameScore, IntForeNameRight IntSoundexNames and IntForeNameWrong apply to all name matches, if I set them to zero won't that also impact matching of names for the individuals themselves and for their spouses and children?
I would really like to leave name matching alone for the individuals, their spouses, and children and only impact mothers and fathers names if possible.
Ideally, if the mothers and fathers surnames are not an exact match, I would deduct so many points that the individuals ended up with no points or negative points.
If the surnames were an exact match, then I would deduct some points if the forenames did not match.
What might work is options something like this:
IntNamesDeductFathSurname
IntNamesDeductFathForenames
IntNamesDeductMothSurname
IntNamesDeductMothForenames
Rather than these working only if the names are a total mismatch, they would work if the names were not a total match. So... we would be able to deduct points if one mother was named Julia Ann Snodgrass and one was named Ann Marie Henshaw for example. The common Ann should not prevent points from being deducted.
Would something like that be possible?
Thanks,
Bill