* How to tweak Find Duplicates?
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
How to tweak Find Duplicates?
Is there a way I can tweak the Find Duplicate plugin to do the following?
I am researching the Lefever / Lefevre surname for a one-name study. The problem that I have is that the spelling of the surname changes depending on the record so the same person can and does, in a lot of cases, use one or more of Lefever, Le Fever, Lefevre and Le Fevre. Is there a way that I could add a conditional to search on all the variations of the surname and treat them as the same respecting spouses and ancestors / descendants names? Or am I expecting too much?
I am researching the Lefever / Lefevre surname for a one-name study. The problem that I have is that the spelling of the surname changes depending on the record so the same person can and does, in a lot of cases, use one or more of Lefever, Le Fever, Lefevre and Le Fevre. Is there a way that I could add a conditional to search on all the variations of the surname and treat them as the same respecting spouses and ancestors / descendants names? Or am I expecting too much?
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
The Plugin should be able to obtain good matches between such surnames as it stands.
Surnames are de-spaced and converted to UPPERCASE before comparison.
(This avoids matching similar forenames that are converted to lowercase.)
So Lefever and Le Fever both become LEFEVER,
and Lefevre and Le Fevre both become LEFEVRE.
Also both LEFEVER and LEFEVRE have the same Soundex code of L116.
All Primary Names and Alternate Names are compared, so if each Individual Record has a number of such names then they will match well.
Also the forenames are matched to boost the score, and if a threshold of 6 points is attained, then their key Events and Relatives are also compared.
The Set Preferences tab allows the point scoring to be tweaked.
e.g. The points for a Names Soundex match increased, or the Names Threshold reduced.
On the Find Duplicates tab you can choose any subset of Individuals to search.
e.g. Just those with the surname Lefever, Le Fever, Lefevre and Le Fevre.
Have you tried using the Plugin, and if so, have you any examples of unsatisfactory results that lead you to request ways of tweaking the Plugin.
Since it is a one name study, I suspect your objective is to get the Plugin to treat everyone with a similar surname as having the "same" surname, then it will be all the other comparisons of Events and Relatives that determine possible duplication.
As long as the Name matching exceeds the threshold, then that objective should be satisfied.
Surnames are de-spaced and converted to UPPERCASE before comparison.
(This avoids matching similar forenames that are converted to lowercase.)
So Lefever and Le Fever both become LEFEVER,
and Lefevre and Le Fevre both become LEFEVRE.
Also both LEFEVER and LEFEVRE have the same Soundex code of L116.
All Primary Names and Alternate Names are compared, so if each Individual Record has a number of such names then they will match well.
Also the forenames are matched to boost the score, and if a threshold of 6 points is attained, then their key Events and Relatives are also compared.
The Set Preferences tab allows the point scoring to be tweaked.
e.g. The points for a Names Soundex match increased, or the Names Threshold reduced.
On the Find Duplicates tab you can choose any subset of Individuals to search.
e.g. Just those with the surname Lefever, Le Fever, Lefevre and Le Fevre.
Have you tried using the Plugin, and if so, have you any examples of unsatisfactory results that lead you to request ways of tweaking the Plugin.
Since it is a one name study, I suspect your objective is to get the Plugin to treat everyone with a similar surname as having the "same" surname, then it will be all the other comparisons of Events and Relatives that determine possible duplication.
As long as the Name matching exceeds the threshold, then that objective should be satisfied.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike,
Thanks for the full description of how the plugin works. I have been using it since you originally released it. I will sort out some specifics for you as it works for some and not for others, unfortunately I did a manual clean-up yesterday before seeing your reply and so it may be a while before I come back to you.
Thanks for the full description of how the plugin works. I have been using it since you originally released it. I will sort out some specifics for you as it works for some and not for others, unfortunately I did a manual clean-up yesterday before seeing your reply and so it may be a while before I come back to you.
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike
I have found one that has not come up
Name of both records Lefever, Selina Emily
Record 1
Record 2
This did not come up as a possible duplicate, I have not altered the Find Duplicate default ratings for lifetime events / names etc.
Any thoughts on tweaks that would help as this would appear to be where my problem lies.
I have found one that has not come up
Name of both records Lefever, Selina Emily
Record 1
- Birth: 1863 in Shoreditch, London UK
Record 2
- Birth: 1863 (approx) in Shoreditch, London UK
This did not come up as a possible duplicate, I have not altered the Find Duplicate default ratings for lifetime events / names etc.
Any thoughts on tweaks that would help as this would appear to be where my problem lies.
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
I have analysed those two records with the following results.
The Names match to give 7 points for Surname and 2 x 6 points for two forenames = 19 total.
The Births match to give 2 points for overlapping Dates and 2 x 3 points for matching Place Parts = 8 total.
Since no other Events nor Relatives will match, the grand total is 27 points, 9 percent.
Unless you have lots of higher scoring entries, this pair should appear in the top 100 of the Result Set.
If there are lots of higher scoring entries, then these need to be dealt with first, by merging the genuine duplicates, and adding non-duplicates to the Omit Non-Duplicates tab.
Otherwise, use the Set Preferences tab to increase Results Maximum Rows to say 200 to include lower scoring entries.
Alternatively, there must be some other data that is affecting them:-
1) On the Find Duplicates tab is Include Individuals last Updated from this Date still set to 1 Jan 1900? - If not, and these records have an earlier Updated date, then they will be excluded.
2) Do they both have Sex set to Female? - If not, the score will be reduced by 10 points to 17.
3) Seems unlikely, but are they closely related? - Select both records and use Tools > How Related. - Family members are excluded or have points deducted.
The Names match to give 7 points for Surname and 2 x 6 points for two forenames = 19 total.
The Births match to give 2 points for overlapping Dates and 2 x 3 points for matching Place Parts = 8 total.
Since no other Events nor Relatives will match, the grand total is 27 points, 9 percent.
Unless you have lots of higher scoring entries, this pair should appear in the top 100 of the Result Set.
If there are lots of higher scoring entries, then these need to be dealt with first, by merging the genuine duplicates, and adding non-duplicates to the Omit Non-Duplicates tab.
Otherwise, use the Set Preferences tab to increase Results Maximum Rows to say 200 to include lower scoring entries.
Alternatively, there must be some other data that is affecting them:-
1) On the Find Duplicates tab is Include Individuals last Updated from this Date still set to 1 Jan 1900? - If not, and these records have an earlier Updated date, then they will be excluded.
2) Do they both have Sex set to Female? - If not, the score will be reduced by 10 points to 17.
3) Seems unlikely, but are they closely related? - Select both records and use Tools > How Related. - Family members are excluded or have points deducted.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike
Thanks for taking a look at this.
Yes they are both female. The 2 records were created in the last 2 years so the 1900 date is fine. Tools > How related shows No Relationship. I have increased the record count to 200 and the list now goes down to a lowest match of 8% and these 2 records still don't show.
Any other thoughts?
Thanks for taking a look at this.
Yes they are both female. The 2 records were created in the last 2 years so the 1900 date is fine. Tools > How related shows No Relationship. I have increased the record count to 200 and the list now goes down to a lowest match of 8% and these 2 records still don't show.
Any other thoughts?
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
I am running out of ideas.
Can you confirm that Record 2 mentioned above has only a Name and Birth Event.
Has no other Facts (Events/Attributes), and no Relatives (Father, Mother, Husband, Children).
Double check that the Name and Date & Place of Birth are identical except that one Date is approximate.
Run the Plugin and tick Enable Diagnostic Mode at the bottom.
Click the Include any Selected subset of the Individuals button and select just those two Selina Emily Lefever records.
Now click the Find any Duplicates constrained by ... button, and those records will almost certainly be listed in the Result Set.
What points are shown under which headings?
Can you confirm that Record 2 mentioned above has only a Name and Birth Event.
Has no other Facts (Events/Attributes), and no Relatives (Father, Mother, Husband, Children).
Double check that the Name and Date & Place of Birth are identical except that one Date is approximate.
Run the Plugin and tick Enable Diagnostic Mode at the bottom.
Click the Include any Selected subset of the Individuals button and select just those two Selina Emily Lefever records.
Now click the Find any Duplicates constrained by ... button, and those records will almost certainly be listed in the Result Set.
What points are shown under which headings?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Jane
- Site Admin
- Posts: 8442
- Joined: 01 Nov 2002 15:00
- Family Historian: V7
- Location: Somerset, England
- Contact:
Re: How to tweak Find Duplicates?
Colin, if you export the two ladies with their close relations to an other file do they match then?
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike
They are identical for the name and place the date is 1863 for one and 1863(app) for the other in the properties window.
The points output is attached. All the remaining points boxes are zero.
They are identical for the name and place the date is 1863 for one and 1863(app) for the other in the properties window.
The points output is attached. All the remaining points boxes are zero.
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
The Name match scores 19 points as I predicted.
The Birth match only scores 5 points, which tells me that the Place fields are NOT identical.
Since one Date is approximate, and the other is not, and they only define the Year, they score 2 points for an overlapping date range, because for an approximate Year the Plugin extends the range by +/-5 years.
Thus the Place parts score only 3 points, which means only one of the two parts Shoreditch and London UK actually match up.
So I suspect either Shoreditch or London UK is missing from one of the Birth Events or spelt significantly differently.
e.g. Shoreditch in one and Shoredich in the other, which not only don't match, but have different Soundex codes.
The message to take away is to use the Tools > Work with Data > Places and merge similar but different entries.
This pair would eventually rise to the top of the Result Set once the higher scoring pairs have been analysed, and either merged as duplicates, or excluded as non-duplicates.
I know that to the human eye they look obvious duplicates, but the Plugin has little to go on when only the Name is a good match, and there is only one Event with similar parameters.
If the Plugin gives too much weight to such tenuous matches it tends to report too many false positives.
The Birth match only scores 5 points, which tells me that the Place fields are NOT identical.
Since one Date is approximate, and the other is not, and they only define the Year, they score 2 points for an overlapping date range, because for an approximate Year the Plugin extends the range by +/-5 years.
Thus the Place parts score only 3 points, which means only one of the two parts Shoreditch and London UK actually match up.
So I suspect either Shoreditch or London UK is missing from one of the Birth Events or spelt significantly differently.
e.g. Shoreditch in one and Shoredich in the other, which not only don't match, but have different Soundex codes.
The message to take away is to use the Tools > Work with Data > Places and merge similar but different entries.
This pair would eventually rise to the top of the Result Set once the higher scoring pairs have been analysed, and either merged as duplicates, or excluded as non-duplicates.
I know that to the human eye they look obvious duplicates, but the Plugin has little to go on when only the Name is a good match, and there is only one Event with similar parameters.
If the Plugin gives too much weight to such tenuous matches it tends to report too many false positives.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike
Thanks very much for sticking with me and this issue.
Your prompt about the address has caused me to run the Places clean-up and I have tidied up all my places (took a long while) but even still these two aren't showing.
Can you please remind me how to exclude non duplicates that have been reviewed? I seem to have forgotten and couldn't find it in the plugin instructions or in the knowledgebase.
Thanks very much for sticking with me and this issue.
Your prompt about the address has caused me to run the Places clean-up and I have tidied up all my places (took a long while) but even still these two aren't showing.
Can you please remind me how to exclude non duplicates that have been reviewed? I seem to have forgotten and couldn't find it in the plugin instructions or in the knowledgebase.
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
Try the Omit Non-Duplicates tab and Help & Advice button!
That help is also in the KB under FH V5 Plugins > Plugin Help & Advice.
See plugins:help:find_duplicate_individuals:find_duplicate_individuals|> Find Duplicate Individuals.
Also if you enter Find Duplicate Individuals in the KB Search box the Results lead you to relevant pages.
Did the two ladies have differing Place names or not?
As I said, while your 100th Results Set entry is greater than 24/27 points 8/9% they will not be listed.
That help is also in the KB under FH V5 Plugins > Plugin Help & Advice.
See plugins:help:find_duplicate_individuals:find_duplicate_individuals|> Find Duplicate Individuals.
Also if you enter Find Duplicate Individuals in the KB Search box the Results lead you to relevant pages.
Did the two ladies have differing Place names or not?
As I said, while your 100th Results Set entry is greater than 24/27 points 8/9% they will not be listed.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Is there no way to select them all is it literally one at a time!!!!
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
Yes, one at a time, because I assumed users would only be reviewing them a few at a time, and merging or excluding as they went along.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
I have 200 to do this is going to take an age.
Particularly as it takes 2 clicks per entry.
- johnmorrisoniom
- Megastar
- Posts: 882
- Joined: 18 Dec 2008 07:40
- Family Historian: V7
- Location: Isle of Man
Re: How to tweak Find Duplicates?
I am with Valkrider on this one. I to would like the ability to move multiple choices (or All) onto the omit duplicates list. Or a drag and drop option!.
I tend to check all the result set first, do all merges that result. Then I want to move all remaining to the Omit list.
I tend to check all the result set first, do all merges that result. Then I want to move all remaining to the Omit list.
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
OK, I give in, I'll have a look at that 
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- tatewise
- Megastar
- Posts: 27087
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: How to tweak Find Duplicates?
Multiple selection of entries on the Omit Non-Duplicates tab is now incorporated into V3.4 in the Plugin Store.
BTW: I never got an answer to my question:
BTW: I never got an answer to my question:
Did the two ladies have differing Place names or not?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- Valkrider
- Megastar
- Posts: 1534
- Joined: 04 Jun 2012 19:03
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: How to tweak Find Duplicates?
Mike
Thanks for the updated plugin. Downloading it now.
Sorry I did not answer your questions specifically. If they were different the 'Tidy Places' that I did probably sorted it, anyway in the end I merged them as they were duplicates. Sorry that I ran the update before I answered your question.
Thanks for the updated plugin. Downloading it now.
Sorry I did not answer your questions specifically. If they were different the 'Tidy Places' that I did probably sorted it, anyway in the end I merged them as they were duplicates. Sorry that I ran the update before I answered your question.