Page 1 of 1
How to tweak Find Duplicates?
Posted: 27 Jan 2014 07:40
by Valkrider
Is there a way I can tweak the Find Duplicate plugin to do the following?
I am researching the Lefever / Lefevre surname for a one-name study. The problem that I have is that the spelling of the surname changes depending on the record so the same person can and does, in a lot of cases, use one or more of Lefever, Le Fever, Lefevre and Le Fevre. Is there a way that I could add a conditional to search on all the variations of the surname and treat them as the same respecting spouses and ancestors / descendants names? Or am I expecting too much?
Re: How to tweak Find Duplicates?
Posted: 27 Jan 2014 11:35
by tatewise
The Plugin should be able to obtain good matches between such surnames as it stands.
Surnames are de-spaced and converted to UPPERCASE before comparison.
(This avoids matching similar forenames that are converted to lowercase.)
So Lefever and Le Fever both become LEFEVER,
and Lefevre and Le Fevre both become LEFEVRE.
Also both LEFEVER and LEFEVRE have the same Soundex code of L116.
All Primary Names and Alternate Names are compared, so if each Individual Record has a number of such names then they will match well.
Also the forenames are matched to boost the score, and if a threshold of 6 points is attained, then their key Events and Relatives are also compared.
The Set Preferences tab allows the point scoring to be tweaked.
e.g. The points for a Names Soundex match increased, or the Names Threshold reduced.
On the Find Duplicates tab you can choose any subset of Individuals to search.
e.g. Just those with the surname Lefever, Le Fever, Lefevre and Le Fevre.
Have you tried using the Plugin, and if so, have you any examples of unsatisfactory results that lead you to request ways of tweaking the Plugin.
Since it is a one name study, I suspect your objective is to get the Plugin to treat everyone with a similar surname as having the "same" surname, then it will be all the other comparisons of Events and Relatives that determine possible duplication.
As long as the Name matching exceeds the threshold, then that objective should be satisfied.
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 07:52
by Valkrider
Mike,
Thanks for the full description of how the plugin works. I have been using it since you originally released it. I will sort out some specifics for you as it works for some and not for others, unfortunately I did a manual clean-up yesterday before seeing your reply and so it may be a while before I come back to you.
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 13:19
by Valkrider
Mike
I have found one that has not come up
Name of both records Lefever, Selina Emily
Record 1
- Birth: 1863 in Shoreditch, London UK
More records including census and parents
Record 2
- Birth: 1863 (approx) in Shoreditch, London UK
No further information
This did not come up as a possible duplicate, I have not altered the Find Duplicate default ratings for lifetime events / names etc.
Any thoughts on tweaks that would help as this would appear to be where my problem lies.
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 14:43
by tatewise
I have analysed those two records with the following results.
The Names match to give 7 points for Surname and 2 x 6 points for two forenames = 19 total.
The Births match to give 2 points for overlapping Dates and 2 x 3 points for matching Place Parts = 8 total.
Since no other Events nor Relatives will match, the grand total is 27 points, 9 percent.
Unless you have lots of higher scoring entries, this pair should appear in the top 100 of the Result Set.
If there are lots of higher scoring entries, then these need to be dealt with first, by merging the genuine duplicates, and adding non-duplicates to the Omit Non-Duplicates tab.
Otherwise, use the Set Preferences tab to increase Results Maximum Rows to say 200 to include lower scoring entries.
Alternatively, there must be some other data that is affecting them:-
1) On the Find Duplicates tab is Include Individuals last Updated from this Date still set to 1 Jan 1900? - If not, and these records have an earlier Updated date, then they will be excluded.
2) Do they both have Sex set to Female? - If not, the score will be reduced by 10 points to 17.
3) Seems unlikely, but are they closely related? - Select both records and use Tools > How Related. - Family members are excluded or have points deducted.
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 15:39
by Valkrider
Mike
Thanks for taking a look at this.
Yes they are both female. The 2 records were created in the last 2 years so the 1900 date is fine. Tools > How related shows No Relationship. I have increased the record count to 200 and the list now goes down to a lowest match of 8% and these 2 records still don't show.
Any other thoughts?
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 17:43
by tatewise
I am running out of ideas.
Can you confirm that Record 2 mentioned above has only a Name and Birth Event.
Has no other Facts (Events/Attributes), and no Relatives (Father, Mother, Husband, Children).
Double check that the Name and Date & Place of Birth are identical except that one Date is approximate.
Run the Plugin and tick Enable Diagnostic Mode at the bottom.
Click the Include any Selected subset of the Individuals button and select just those two Selina Emily Lefever records.
Now click the Find any Duplicates constrained by ... button, and those records will almost certainly be listed in the Result Set.
What points are shown under which headings?
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 20:05
by Jane
Colin, if you export the two ladies with their close relations to an other file do they match then?
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 21:02
by Valkrider
Mike
They are identical for the name and place the date is 1863 for one and 1863(app) for the other in the properties window.
The points output is attached. All the remaining points boxes are zero.

- two.JPG (24.18 KiB) Viewed 11738 times
Re: How to tweak Find Duplicates?
Posted: 28 Jan 2014 22:36
by tatewise
The Name match scores 19 points as I predicted.
The Birth match only scores 5 points, which tells me that the Place fields are NOT identical.
Since one Date is approximate, and the other is not, and they only define the Year, they score 2 points for an overlapping date range, because for an approximate Year the Plugin extends the range by +/-5 years.
Thus the Place parts score only 3 points, which means only one of the two parts Shoreditch and London UK actually match up.
So I suspect either Shoreditch or London UK is missing from one of the Birth Events or spelt significantly differently.
e.g. Shoreditch in one and Shoredich in the other, which not only don't match, but have different Soundex codes.
The message to take away is to use the Tools > Work with Data > Places and merge similar but different entries.
This pair would eventually rise to the top of the Result Set once the higher scoring pairs have been analysed, and either merged as duplicates, or excluded as non-duplicates.
I know that to the human eye they look obvious duplicates, but the Plugin has little to go on when only the Name is a good match, and there is only one Event with similar parameters.
If the Plugin gives too much weight to such tenuous matches it tends to report too many false positives.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 15:23
by Valkrider
Mike
Thanks very much for sticking with me and this issue.
Your prompt about the address has caused me to run the Places clean-up and I have tidied up all my places (took a long while) but even still these two aren't showing.
Can you please remind me how to exclude non duplicates that have been reviewed? I seem to have forgotten and couldn't find it in the plugin instructions or in the knowledgebase.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 15:28
by tatewise
Try the Omit Non-Duplicates tab and Help & Advice button!
That help is also in the KB under FH V5 Plugins > Plugin Help & Advice.
See plugins:help:find_duplicate_individuals:find_duplicate_individuals|> Find Duplicate Individuals.
Also if you enter Find Duplicate Individuals in the KB Search box the Results lead you to relevant pages.
Did the two ladies have differing Place names or not?
As I said, while your 100th Results Set entry is greater than 24/27 points 8/9% they will not be listed.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 16:58
by Valkrider
Is there no way to select them all is it literally one at a time!!!!
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 17:02
by tatewise
Yes, one at a time, because I assumed users would only be reviewing them a few at a time, and merging or excluding as they went along.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 17:08
by Valkrider
I have 200 to do this is going to take an age.

Particularly as it takes 2 clicks per entry.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 18:48
by johnmorrisoniom
I am with Valkrider on this one. I to would like the ability to move multiple choices (or All) onto the omit duplicates list. Or a drag and drop option!.
I tend to check all the result set first, do all merges that result. Then I want to move all remaining to the Omit list.
Re: How to tweak Find Duplicates?
Posted: 29 Jan 2014 19:18
by tatewise
OK, I give in, I'll have a look at that

Re: How to tweak Find Duplicates?
Posted: 02 Apr 2014 10:15
by tatewise
Multiple selection of entries on the
Omit Non-Duplicates tab is now incorporated into
V3.4 in the
Plugin Store.
BTW: I never got an answer to my question:
Did the two ladies have differing Place names or not?
Re: How to tweak Find Duplicates?
Posted: 02 Apr 2014 11:08
by Valkrider
Mike
Thanks for the updated plugin. Downloading it now.
Sorry I did not answer your questions specifically. If they were different the 'Tidy Places' that I did probably sorted it, anyway in the end I merged them as they were duplicates. Sorry that I ran the update before I answered your question.