* Find Duplicate Individuals Version 1.5+
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
Yes.
Assume originally Tom Bodles (A) matched Tom Bodles (B) in the Result Set and the pair were placed in the Non-Duplicates list.
Then you add Tom Bodles (C), and assuming enough details match then the Result Set should include:
Tom Bodles (A) matches Tom Bodles (C)
Tom Bodles (B) matches Tom Bodles (C)
Assume originally Tom Bodles (A) matched Tom Bodles (B) in the Result Set and the pair were placed in the Non-Duplicates list.
Then you add Tom Bodles (C), and assuming enough details match then the Result Set should include:
Tom Bodles (A) matches Tom Bodles (C)
Tom Bodles (B) matches Tom Bodles (C)
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- LornaCraig
- Megastar
- Posts: 2989
- Joined: 11 Jan 2005 17:36
- Family Historian: V7
- Location: Oxfordshire, UK
Find Duplicate Individuals Version 1.5+
Mike,
The chronology checks in V1.6 seem to be working fairly well but could probably still be extended. The plugin is still trying to match (admittedly with a low score) an individual whose children were baptised in the 1640s with an individual who was born in 1945. The date of baptism of the first individuals children could be used to assign him a birth date before 1640, but no points have been deducted for a non matching date.
It might be a good idea to assume a lifespan of no more than 100 years (with apologies to anyone who has very long-lived ancestors). This would mean that if a birth date is known but there is no death date, the plugin would not try to match the individual with someone who has some life events more than 100 years later. Conversely if a death date is known but no birth date, it would not try to match the individual with someone alive more that 100 years earlier.
On the question of deducting points for non-matching relatives names, as you said in an earlier post -10 or -20 points is now less than 10% of high scoring duplicates. However I suspect that in practice very few duplicates will get scores higher than about 50 and some may get as little as 10. If they scored as much as 200 they would have so much in common that they would probably already have been noticed and the records merged. Some genuine duplicates will inevitably get low scores. There may not be many positive matches in the data, perhaps because the data for one of them is scarce or, crucially, because an individual with two spouse families has been entered as two different people. It is the job of the plugin to show that the records are compatible even if they have a low score. Deducting points for non-matching spouse or child names could wipe out the low score. I am still inclined to think that it is enough to add points for matching spouse and child names, without deducting points for non-matches.
The chronology checks in V1.6 seem to be working fairly well but could probably still be extended. The plugin is still trying to match (admittedly with a low score) an individual whose children were baptised in the 1640s with an individual who was born in 1945. The date of baptism of the first individuals children could be used to assign him a birth date before 1640, but no points have been deducted for a non matching date.
It might be a good idea to assume a lifespan of no more than 100 years (with apologies to anyone who has very long-lived ancestors). This would mean that if a birth date is known but there is no death date, the plugin would not try to match the individual with someone who has some life events more than 100 years later. Conversely if a death date is known but no birth date, it would not try to match the individual with someone alive more that 100 years earlier.
On the question of deducting points for non-matching relatives names, as you said in an earlier post -10 or -20 points is now less than 10% of high scoring duplicates. However I suspect that in practice very few duplicates will get scores higher than about 50 and some may get as little as 10. If they scored as much as 200 they would have so much in common that they would probably already have been noticed and the records merged. Some genuine duplicates will inevitably get low scores. There may not be many positive matches in the data, perhaps because the data for one of them is scarce or, crucially, because an individual with two spouse families has been entered as two different people. It is the job of the plugin to show that the records are compatible even if they have a low score. Deducting points for non-matching spouse or child names could wipe out the low score. I am still inclined to think that it is enough to add points for matching spouse and child names, without deducting points for non-matches.
Lorna
-
TimTreeby
- Famous
- Posts: 168
- Joined: 12 Sep 2003 14:56
- Family Historian: V6.2
- Location: Ogwell, Devon
- Contact:
Find Duplicate Individuals Version 1.5+
Mike,
I have found a slight problem with your date checking Algorithim. This I believe is due to you treating dates as ranges but not considering (bef), (aft) and (app) dates.
I.e in your explaintory notes you say
Any single Date is treated similarly, so 1 Feb 1777 starts & ends on 1 Feb 1777, whereas 1666 starts 1 Jan 1666 and ends 31 Dec 1666.
This i think then leads to a date of (bef) 1799 being treated as 1st Jan 1799 - 31st Dec 1799
and a date of 1796 (app) being treated as 1st Jan 1796 to Dec 31st 1796. This would lead to NO MATCHES, and therefore deducting points and missing possible matches or reducing the score for probable matches. Unless i have misunderstood how the date matching works.
If it is possible i would suggest the following which would overcome this
(bef) : set range from (year-10 years) to year
(aft) : set range from year to (year+10 years)
(app) : set range to (year-5 years) to (year+5 years)
I have a Gedcom file where i can prove this to be the case if you want a test file, have extracted just the Boundy's as shown in Diagram on my previous post. If i change the dates around then the Elizabeth Hancock's get a much higher score and then the John Hancock's show as matches as well as Jeneffe Boundy to Jenefee (Hancock).
I have found a slight problem with your date checking Algorithim. This I believe is due to you treating dates as ranges but not considering (bef), (aft) and (app) dates.
I.e in your explaintory notes you say
Any single Date is treated similarly, so 1 Feb 1777 starts & ends on 1 Feb 1777, whereas 1666 starts 1 Jan 1666 and ends 31 Dec 1666.
This i think then leads to a date of (bef) 1799 being treated as 1st Jan 1799 - 31st Dec 1799
and a date of 1796 (app) being treated as 1st Jan 1796 to Dec 31st 1796. This would lead to NO MATCHES, and therefore deducting points and missing possible matches or reducing the score for probable matches. Unless i have misunderstood how the date matching works.
If it is possible i would suggest the following which would overcome this
(bef) : set range from (year-10 years) to year
(aft) : set range from year to (year+10 years)
(app) : set range to (year-5 years) to (year+5 years)
I have a Gedcom file where i can prove this to be the case if you want a test file, have extracted just the Boundy's as shown in Diagram on my previous post. If i change the dates around then the Elizabeth Hancock's get a much higher score and then the John Hancock's show as matches as well as Jeneffe Boundy to Jenefee (Hancock).
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
Tim ~ Progress on the Plugin is a bit slow at present, because of my other commitments.
However, your conclusions are absolutely correct, and I had already created a few individuals based on your tree diagrams above, and implemented the Before/After (and From/To) exactly as you propose.
I was not sure whether to use +/-10 years, or +/-100 years, or even as near infinity as FH would allow, but you have plumped for the same as me.
This does result in increased scores for the Hancocks ~ Elizabeth HANCOCK gets 46 points, and John gets 30.
I am not so sure about Approximate (or Calculated) dates. What do others think?
How should I evaluate for example 4 July 1888 (Approx) or May 1777 (Approx) as opposed to 1666 (Approx)?
On another topic suggested by Bill, I have revisited Forename checking and implemented a scheme to take into account their position.
I won't go into the algorithm details, but the Name scoring is as follows and easily tweak-able:
7 points for SURNAME match.
6 points for Forename match in correct position i.e. both 1st, or both 2nd, etc.
3 points for Forename match but different position.
2 points for Soundex match, excluding all the above.
Knowing this, if users suspect an individual with multiple Forenames is a duplicate, then creating Alternate Names with the Forenames in different positions may find an elusive match, i.e. John James SMITH as well as James John SMITH.
Tim ~ This new scheme boosts the score for the Lyle Boundy SMITH pair to 19 points together with the other tweak below.
The biggest problem with this pair is that they are 1st cousins and suffer a Generation Gap deduction, but are not such close relatives as to be removed from the results.
I had assumed that duplicate close relatives like this would soon be spotted by user inspection of family trees.
I don't understand why the siblings Lena BOUNDY and Harry BOUNDY have their parents separated by a blue clone ribbon.
Also the matching is inhibited by making Lyle Boundy SMITH a son of Harry BOUNDY i.e. the wrong Surname.
Nevertheless, now that immediate relatives are removed from the results, I have tweak the deductions for remaining close relatives to only -15, -10, or -5.
I am beginning to wonder if the published Plugin needs preference options to tweak your own points, but that is for another day.
However, your conclusions are absolutely correct, and I had already created a few individuals based on your tree diagrams above, and implemented the Before/After (and From/To) exactly as you propose.
I was not sure whether to use +/-10 years, or +/-100 years, or even as near infinity as FH would allow, but you have plumped for the same as me.
This does result in increased scores for the Hancocks ~ Elizabeth HANCOCK gets 46 points, and John gets 30.
I am not so sure about Approximate (or Calculated) dates. What do others think?
How should I evaluate for example 4 July 1888 (Approx) or May 1777 (Approx) as opposed to 1666 (Approx)?
On another topic suggested by Bill, I have revisited Forename checking and implemented a scheme to take into account their position.
I won't go into the algorithm details, but the Name scoring is as follows and easily tweak-able:
7 points for SURNAME match.
6 points for Forename match in correct position i.e. both 1st, or both 2nd, etc.
3 points for Forename match but different position.
2 points for Soundex match, excluding all the above.
Knowing this, if users suspect an individual with multiple Forenames is a duplicate, then creating Alternate Names with the Forenames in different positions may find an elusive match, i.e. John James SMITH as well as James John SMITH.
Tim ~ This new scheme boosts the score for the Lyle Boundy SMITH pair to 19 points together with the other tweak below.
The biggest problem with this pair is that they are 1st cousins and suffer a Generation Gap deduction, but are not such close relatives as to be removed from the results.
I had assumed that duplicate close relatives like this would soon be spotted by user inspection of family trees.
I don't understand why the siblings Lena BOUNDY and Harry BOUNDY have their parents separated by a blue clone ribbon.
Also the matching is inhibited by making Lyle Boundy SMITH a son of Harry BOUNDY i.e. the wrong Surname.
Nevertheless, now that immediate relatives are removed from the results, I have tweak the deductions for remaining close relatives to only -15, -10, or -5.
I am beginning to wonder if the published Plugin needs preference options to tweak your own points, but that is for another day.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
TimTreeby
- Famous
- Posts: 168
- Joined: 12 Sep 2003 14:56
- Family Historian: V6.2
- Location: Ogwell, Devon
- Contact:
Find Duplicate Individuals Version 1.5+
Hi Mike,
If it was just me then for (app) and (cal) dates i would probably go for if just year +/- 5 years, if month and year +/- 5 months and if a full date +/- 5 days.
Although can't think of too many occasions where a full date would then be put in as (app) or (cal), so i don't think that would matter too much, but that is just me.
Regarding the Boundy's and the Blue Ribbon is because the trees is done as two Ancestor diagrams of the two Lyle Boundy Smith's i have, should be the same person as was born to William John Smith & Lena Boundy but then brought up as Harry Boundy's Son. Not sure if an official adoption or just raised as his Son.
Tim
If it was just me then for (app) and (cal) dates i would probably go for if just year +/- 5 years, if month and year +/- 5 months and if a full date +/- 5 days.
Although can't think of too many occasions where a full date would then be put in as (app) or (cal), so i don't think that would matter too much, but that is just me.
Regarding the Boundy's and the Blue Ribbon is because the trees is done as two Ancestor diagrams of the two Lyle Boundy Smith's i have, should be the same person as was born to William John Smith & Lena Boundy but then brought up as Harry Boundy's Son. Not sure if an official adoption or just raised as his Son.
Tim
- RogerF
- Famous
- Posts: 182
- Joined: 26 Apr 2009 16:32
- Family Historian: V6.2
- Location: Oxfordshire, England
- Contact:
Find Duplicate Individuals Version 1.5+
Mike said:
Personally, I feel Preferences to be overkill; I suspect relatively few users will feel the need to tweak. What would be useful, for that minority, would be to have all of the scores defined as well-commented constants at the head of the Plugin, so that tweaking would be achieved by clearly-defined editing of the Plugin source.I am beginning to wonder if the published Plugin needs preference options to tweak your own points, but that is for another day.
Roger Firth, using FH to research the FIRTHs of Lancashire and Yorkshire, and the residents of the market town where I live.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
Roger ~ Yes, that would be a good method, and SOOOOO much easier for me to implement!!
Tim ~ So my argument about close relatives like this being spotted as duplicates in the tree diagram is valid?
I presume you already knew about this duplication before using the Plugin?
If the Plugin misses such duplicates it is no great problem.
It is the more elusive duplicates that are important to find.
This case is also an example of the adoptive parents scenario discussed earlier, so the argument to never deduct points for mismatching Names is getting stronger.
Implementing the Approx/Calc dates should be straight forward, although it all adds a bit to the run time.
Tim ~ So my argument about close relatives like this being spotted as duplicates in the tree diagram is valid?
I presume you already knew about this duplication before using the Plugin?
If the Plugin misses such duplicates it is no great problem.
It is the more elusive duplicates that are important to find.
This case is also an example of the adoptive parents scenario discussed earlier, so the argument to never deduct points for mismatching Names is getting stronger.
Implementing the Approx/Calc dates should be straight forward, although it all adds a bit to the run time.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
Thanks for implementing something for situations where the forenames are in different orders. I have a thought though. If the surname is the same and the forenames are in different orders, this would still allow 10 points. I would like to see forenames in different orders actually reduce the points. Just a thought.
I would still like to be able to deduct points or eliminate individuals who have one or both parents with mismatches, especially in the surname. I have very few (3 out or 9000) adoptions in my tree. Shouldn't the plugin handle the more common scenario more so than the more rare scenario? Maybe as an option?
Thanks,
Bill
Thanks for implementing something for situations where the forenames are in different orders. I have a thought though. If the surname is the same and the forenames are in different orders, this would still allow 10 points. I would like to see forenames in different orders actually reduce the points. Just a thought.
I would still like to be able to deduct points or eliminate individuals who have one or both parents with mismatches, especially in the surname. I have very few (3 out or 9000) adoptions in my tree. Shouldn't the plugin handle the more common scenario more so than the more rare scenario? Maybe as an option?
Thanks,
Bill
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals Version 1.5+
Mike,
I want to re-read and fully understand your suggestions yesterday before I answer.
In between times I agree with Bill just because the forenames are the wrong way round this should not deduct any points as I have several instances of christening order being reversed in later life (particularly in census) but the correct way round on marriage certificates. I think if the two forenames are correct but in the wrong order nothing should be deducted.
I want to re-read and fully understand your suggestions yesterday before I answer.
In between times I agree with Bill just because the forenames are the wrong way round this should not deduct any points as I have several instances of christening order being reversed in later life (particularly in census) but the correct way round on marriage certificates. I think if the two forenames are correct but in the wrong order nothing should be deducted.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
Bill ~ I forgot to mention that the limit for Name matches will be raised to 20 points, as Name matches are now less likely to swamp the results due to all the other extra scoring compared with early versions of the Plugin.
Also the Individuals that mostly hit the limit were close family relations that are now excluded.
The scheme is consistent with the points for good Forename matches going up from 3 points to 6 points, and 7 points for Surnames.
So in effect, 3 points for a Forename in wrong position, is a deduction of 3 points.
There would have to be more than 4 such out of position Forenames, and a matching Surname, to hit the 20 point limit.
Bill said:
Also the Individuals that mostly hit the limit were close family relations that are now excluded.
The scheme is consistent with the points for good Forename matches going up from 3 points to 6 points, and 7 points for Surnames.
So in effect, 3 points for a Forename in wrong position, is a deduction of 3 points.
There would have to be more than 4 such out of position Forenames, and a matching Surname, to hit the 20 point limit.
Bill said:
Colin said:I would like to see forenames in different orders actually reduce the points.
To cope with this, I will implement Roger Frith's suggestion of values at the head of the Plugin script that users can edit as required.Just because the forenames are the wrong way round this should not deduct any points.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
That will work well I think.
Having the ability to change the values around will help. Each person can then make the values what works for them.
Will there be a way to deduct points for pairs of individuals where one or both of the parents surnames don't match up?
Thanks,
Bill
That will work well I think.
Having the ability to change the values around will help. Each person can then make the values what works for them.
Will there be a way to deduct points for pairs of individuals where one or both of the parents surnames don't match up?
Thanks,
Bill
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Individuals Version 1.5+
Mike
I have now had a think about your question of yesterday.
For 1: I would suggest that + or - one calendar month would get a high score say 7 +/- 3 months would get say 3 and more than that 0
For 2: I would suggest +/- 3 months would get a score of 7 +/- 6 months would get 3 and more than that 0.
For 3: would suggest +/- 2 months would get a score of 7 +/- 18 months would get 3 and more than that 0.
These are my thoughts looking at my datasets and considering registration quarters. The only fly in the ointment may be the UK 1841 census where the ages were rounded and option 3 may need tweaking as a result.
I have now had a think about your question of yesterday.
If I call them option 1, 2 and 3 in the order that you posed the question.I am not so sure about Approximate (or Calculated) dates. What do others think?
How should I evaluate for example 4 July 1888 (Approx) or May 1777 (Approx) as opposed to 1666 (Approx)?
For 1: I would suggest that + or - one calendar month would get a high score say 7 +/- 3 months would get say 3 and more than that 0
For 2: I would suggest +/- 3 months would get a score of 7 +/- 6 months would get 3 and more than that 0.
For 3: would suggest +/- 2 months would get a score of 7 +/- 18 months would get 3 and more than that 0.
These are my thoughts looking at my datasets and considering registration quarters. The only fly in the ointment may be the UK 1841 census where the ages were rounded and option 3 may need tweaking as a result.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
In all these assessments, may I remind you all about a few techniques that the Plugin uses for ALL Event Dates, as described on the WiP help page.
Every Date is assigned a Timespan.
e.g.
4 July 1888 Timespan is 4 July 1888 to 4 July 1888 i.e. 1 day
May 1777 Timespan is 1 May 1777 to 31 May 1777 i.e. 1 month
1666 Timespan is 1 Jan 1666 to 31 Dec 1666 i.e. 1 year
Q1 1666 Timespan is 1 Jan 1666 to 31 Mar 1666 i.e. 3 months
Between 1660 & 1670 is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years
After 1660 Timespan is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years (in next Version)
When comparing two Dates the following applies.
If the two Start of Timespan Dates are less than 50 days apart then 2 points are awarded.
If the two End of Timespan Dates are less than 50 days apart then 2 more points are awarded.
If the two Timespans overlap at all then 2 extra points are awarded.
The 50 days was chosen to allow Dates such as 20 Dec 1665 or 20 Jan 1666 or Feb 1666 to score well against a Quarter Date such as Q1 1666.
So when it comes to dealing with Approximate/Calculated/Estimated Dates, there is already some tolerance built in, and on reflection, maybe only such Dates with no Day nor Month (i.e. Year only) need their Timespan increasing by say +/-5 years.
Every Date is assigned a Timespan.
e.g.
4 July 1888 Timespan is 4 July 1888 to 4 July 1888 i.e. 1 day
May 1777 Timespan is 1 May 1777 to 31 May 1777 i.e. 1 month
1666 Timespan is 1 Jan 1666 to 31 Dec 1666 i.e. 1 year
Q1 1666 Timespan is 1 Jan 1666 to 31 Mar 1666 i.e. 3 months
Between 1660 & 1670 is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years
After 1660 Timespan is 1 Jan 1660 to 31 Dec 1670 i.e. 11 years (in next Version)
When comparing two Dates the following applies.
If the two Start of Timespan Dates are less than 50 days apart then 2 points are awarded.
If the two End of Timespan Dates are less than 50 days apart then 2 more points are awarded.
If the two Timespans overlap at all then 2 extra points are awarded.
The 50 days was chosen to allow Dates such as 20 Dec 1665 or 20 Jan 1666 or Feb 1666 to score well against a Quarter Date such as Q1 1666.
So when it comes to dealing with Approximate/Calculated/Estimated Dates, there is already some tolerance built in, and on reflection, maybe only such Dates with no Day nor Month (i.e. Year only) need their Timespan increasing by say +/-5 years.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- LornaCraig
- Megastar
- Posts: 2989
- Joined: 11 Jan 2005 17:36
- Family Historian: V7
- Location: Oxfordshire, UK
Find Duplicate Individuals Version 1.5+
I wonder whether a timespan of 11 years is enough for the BEFORE and AFTER dates? In your example, after 1660 is interpreted as 1 Jan 1660 to 31 Dec 1670. If it is known that someone was alive in 1660 his/her death might be recorded as 'after 1660', but they could have lived until 1680. Where dates exist but do not match, 10 points are deducted. This could mean that when the individual is compared with someone whose death is recorded as 1680, 10 points are deducted!
On the other hand, perhaps if the dates exist but differ greatly (say by 50 years or more) then more than 10 points should be deducted, to outweigh the fact that the maximum number of possible points has increased a lot. If two people lived 100 years apart there is no question of them being duplicates.
On the other hand, perhaps if the dates exist but differ greatly (say by 50 years or more) then more than 10 points should be deducted, to outweigh the fact that the maximum number of possible points has increased a lot. If two people lived 100 years apart there is no question of them being duplicates.
Lorna
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
The Find Duplicate Individuals Version 1.7 is now available for download.
It incorporates many of your excellent suggestions ~ see the WiP Help page for details.
I am sure you will let me know how it performs against your data.
In particular the Chronology checks are much more extensive and Synthesised time-span Dates are used where Real Event Dates are missing.
User Preference Settings exist at the head of the Plugin to allow you to experiment by editing the points scoring values, etc.
It incorporates many of your excellent suggestions ~ see the WiP Help page for details.
I am sure you will let me know how it performs against your data.
In particular the Chronology checks are much more extensive and Synthesised time-span Dates are used where Real Event Dates are missing.
User Preference Settings exist at the head of the Plugin to allow you to experiment by editing the points scoring values, etc.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
If I understand correctly, the only way to update the preferences is to actually edit the plugin source code. This means it would have to be re-done each time there is a new version. Any chance this could be put into an external dataset that we could update and it could be read into each successive version of the plugin?
Also, I'm not seeing how to change the value given or deducted when two individuals have a mismatch in father or mother or both. Am I just missing it?
Thanks,
Bill
If I understand correctly, the only way to update the preferences is to actually edit the plugin source code. This means it would have to be re-done each time there is a new version. Any chance this could be put into an external dataset that we could update and it could be read into each successive version of the plugin?
Also, I'm not seeing how to change the value given or deducted when two individuals have a mismatch in father or mother or both. Am I just missing it?
Thanks,
Bill
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
An external dataset is a good idea, but starts to get tricky if the User Preference Settings change.
Why not keep your settings in a text file, and Copy & Paste into the script as necessary.
Once published, the Plugin won't change as often as it does now.
You can even Rename the Plugin by adding a suffix such as Bill, and it will still work OK.
Then you can Copy & Paste from your version into the next downloaded version, and Rename again.
Remember, the two Individuals, the two Mothers, Fathers, Spouses, and 1st Children are all compared in the same way, by matching their Names and Events.
So, the Names, Event, Dates, and Place Points are the ones to change.
e.g.
IntNamesDeduction is the deduction for a Name mismatch.
IntEventDeduction is the deduction for an Event mismatch.
None of the points are associated with any particular relative, except perhaps the Generation Gap and Gender scoring.
Why not keep your settings in a text file, and Copy & Paste into the script as necessary.
Once published, the Plugin won't change as often as it does now.
You can even Rename the Plugin by adding a suffix such as Bill, and it will still work OK.
Then you can Copy & Paste from your version into the next downloaded version, and Rename again.
Remember, the two Individuals, the two Mothers, Fathers, Spouses, and 1st Children are all compared in the same way, by matching their Names and Events.
So, the Names, Event, Dates, and Place Points are the ones to change.
e.g.
IntNamesDeduction is the deduction for a Name mismatch.
IntEventDeduction is the deduction for an Event mismatch.
None of the points are associated with any particular relative, except perhaps the Generation Gap and Gender scoring.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
OK... thanks.
Bill
OK... thanks.
Bill
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
I have a follow up question.
How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Thanks,
Bill
I have a follow up question.
How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Thanks,
Bill
- LornaCraig
- Megastar
- Posts: 2989
- Joined: 11 Jan 2005 17:36
- Family Historian: V7
- Location: Oxfordshire, UK
Find Duplicate Individuals Version 1.5+
Mike,
V7 is looking good. The chronology checks are very sophisticated and you have obviously put a lot of work into it!
The ability to customise the points scoring makes the plugin very versatile. Two comments:
1. I note from your reply to Bill that when comparing names of family members, points are not associated with any particular relative. I dont know if it would be possible, but could the scoring for the matching of parents names be handled separately from spouse and 1st child names? I dont want to deduct points for non-matching spouse or child names in case the same individual had two spouse families, but would like to deduct some points for non-matching parents names (although I realise this may obscure cases of adoption or fostering). I think this is probably what Bill would want to do, from what he said in his earlier posts.
2. Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates? (BTW: there is a minor typo: Extimated)
I am happy with the 10 year span for the latter group but would like to extend it to more than 10 years for the before and after dates. I sometimes record a death date as After xxx if I know the individual was still alive then (e.g. they witnessed a marriage or were informant for a death) but I have no trace of them after that. They could still have lived for several decades, perhaps on the other side of the world but maybe just in the next town, in a bigamous marriage! I dont think points should be deducted or a chronology check failed if they are compared with someone who was alive 20 years later. It would be helpful to be able to set the before/after date spans to as much as 30 or 40 years, while leaving the 10 year span for estimated/approximate/calculated dates.
Apart from these two minor enhancements it would be difficult to think of a way of improving what is now an excellent plugin. In my file the majority of positive scores are only 4%, mainly due to matching or similar names and nothing else, so the various event and chronology checks are working well.
Thanks for all your work on this.
V7 is looking good. The chronology checks are very sophisticated and you have obviously put a lot of work into it!
The ability to customise the points scoring makes the plugin very versatile. Two comments:
1. I note from your reply to Bill that when comparing names of family members, points are not associated with any particular relative. I dont know if it would be possible, but could the scoring for the matching of parents names be handled separately from spouse and 1st child names? I dont want to deduct points for non-matching spouse or child names in case the same individual had two spouse families, but would like to deduct some points for non-matching parents names (although I realise this may obscure cases of adoption or fostering). I think this is probably what Bill would want to do, from what he said in his earlier posts.
2. Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates? (BTW: there is a minor typo: Extimated)
I am happy with the 10 year span for the latter group but would like to extend it to more than 10 years for the before and after dates. I sometimes record a death date as After xxx if I know the individual was still alive then (e.g. they witnessed a marriage or were informant for a death) but I have no trace of them after that. They could still have lived for several decades, perhaps on the other side of the world but maybe just in the next town, in a bigamous marriage! I dont think points should be deducted or a chronology check failed if they are compared with someone who was alive 20 years later. It would be helpful to be able to set the before/after date spans to as much as 30 or 40 years, while leaving the 10 year span for estimated/approximate/calculated dates.
Apart from these two minor enhancements it would be difficult to think of a way of improving what is now an excellent plugin. In my file the majority of positive scores are only 4%, mainly due to matching or similar names and nothing else, so the various event and chronology checks are working well.
Thanks for all your work on this.
Lorna
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
I agree completely with Lorna. It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Thanks,
Bill
I agree completely with Lorna. It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Thanks,
Bill
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
Bill asked
Lorna asked
Lorna asked
All these little tweaks are progressively slowing the Plugin down again, but is it still running fast enough, especially on the larger databases?
I have tried to incorporate many of the checks you have suggested to identify your Real Duplicates and eliminate the False Candidates.
Are there any Real Duplicates the Plugin is still NOT finding?
Are there any False Candidates the Plugin should be eliminating?
I still have the Non-Duplicates Management trick up my sleeve.
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.How does IntNamesDeduction work. Does the entire name have to not match in order to get the deduction?
Lorna asked
and Bill askedCould the scoring for the matching of parents names be handled separately from spouse and 1st child names?
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.It would really be nice if we could deduct points for non-matching parents, but not for non-matching spouses and children.
Lorna asked
Yes OK, I will have IntDatesTimespan=50 and IntDatesVariance=5 for these two values respectively.Would it be possible to separate the timespan for the 'After', 'Before', 'From', 'To' dates from the timespan for the 'Approximate'/'Calculated'/'Estimated' Year only Dates?
All these little tweaks are progressively slowing the Plugin down again, but is it still running fast enough, especially on the larger databases?
I have tried to incorporate many of the checks you have suggested to identify your Real Duplicates and eliminate the False Candidates.
Are there any Real Duplicates the Plugin is still NOT finding?
Are there any False Candidates the Plugin should be eliminating?
I still have the Non-Duplicates Management trick up my sleeve.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- johnmorrisoniom
- Megastar
- Posts: 882
- Joined: 18 Dec 2008 07:40
- Family Historian: V7
- Location: Isle of Man
Find Duplicate Individuals Version 1.5+
I'm still getiing quite a lot of 'Matches' with the same fornames but different Surnames and born in different places.
However the highest score I've got for anyone with 1.7 was 50 and most only score about 20 or lower.
I had one genuine duplicate with identical names, and DOB within 1 year, that only got a score of 13,probably because there was no place info for one of them. Even the parents mached up.
However the highest score I've got for anyone with 1.7 was 50 and most only score about 20 or lower.
I had one genuine duplicate with identical names, and DOB within 1 year, that only got a score of 13,probably because there was no place info for one of them. Even the parents mached up.
- tatewise
- Megastar
- Posts: 27078
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Find Duplicate Individuals Version 1.5+
John ~ Could you post the details of a few of the highest scoring False Candidates, and details of the low scoring Real Duplicates that should be scoring more than 13 if the Individual Names and both Parents match, unless there are also some mismatching Events.
With regard to your Save Result Set Wish List Request, I will add an option to re-display the previous Result Set.
With regard to your Save Result Set Wish List Request, I will add an option to re-display the previous Result Set.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- BillH
- Megastar
- Posts: 2179
- Joined: 31 May 2010 03:40
- Family Historian: V7
- Location: Washington State, USA
Find Duplicate Individuals Version 1.5+
Mike,
I'm still getting a lot of non-duplicates on my report. Many of these are high on my report mixed in with the actual duplicates. Most of these would be eliminated or dropped way down if the mis-matched parents names didn' result in so many points.
Since IntLastNameScore, IntForeNameRight IntSoundexNames and IntForeNameWrong apply to all name matches, if I set them to zero won't that also impact matching of names for the individuals themselves and for their spouses and children?
I would really like to leave name matching alone for the individuals, their spouses, and children and only impact mothers and fathers names if possible.
Ideally, if the mothers and fathers surnames are not an exact match, I would deduct so many points that the individuals ended up with no points or negative points.
If the surnames were an exact match, then I would deduct some points if the forenames did not match.
What might work is options something like this:
IntNamesDeductFathSurname
IntNamesDeductFathForenames
IntNamesDeductMothSurname
IntNamesDeductMothForenames
Rather than these working only if the names are a total mismatch, they would work if the names were not a total match. So... we would be able to deduct points if one mother was named Julia Ann Snodgrass and one was named Ann Marie Henshaw for example. The common Ann should not prevent points from being deducted.
Would something like that be possible?
Thanks,
Bill
I'm still getting a lot of non-duplicates on my report. Many of these are high on my report mixed in with the actual duplicates. Most of these would be eliminated or dropped way down if the mis-matched parents names didn' result in so many points.
If the new options you are adding work the same way, then will it really help all that much?Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.
Since IntLastNameScore, IntForeNameRight IntSoundexNames and IntForeNameWrong apply to all name matches, if I set them to zero won't that also impact matching of names for the individuals themselves and for their spouses and children?
I would really like to leave name matching alone for the individuals, their spouses, and children and only impact mothers and fathers names if possible.
Ideally, if the mothers and fathers surnames are not an exact match, I would deduct so many points that the individuals ended up with no points or negative points.
If the surnames were an exact match, then I would deduct some points if the forenames did not match.
What might work is options something like this:
IntNamesDeductFathSurname
IntNamesDeductFathForenames
IntNamesDeductMothSurname
IntNamesDeductMothForenames
Rather than these working only if the names are a total mismatch, they would work if the names were not a total match. So... we would be able to deduct points if one mother was named Julia Ann Snodgrass and one was named Ann Marie Henshaw for example. The common Ann should not prevent points from being deducted.
Would something like that be possible?
Thanks,
Bill