* Find Duplicate Individuals Version 2.3+

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 13 Oct 2012 10:00

Can the users who have been trialling this WiP Plugin please review this latest version.

The Find Duplicate Individuals Version 2.3 is now available for download.

Sorry for the delay but vacations, visiting relations, school holidays, and the Olympics got in the way.

The main change is that all settings are now 'sticky' via the Set Preferences tab.

If you have edited the V2.2 User Preference Settings at the start of the LUA script, then preserve them before installing V2.3, and enter them into the new V2.3 Set Preferences tabs, where they will become saved in the Project ...Plugin DataFind Duplicate Individuals.dat file, which can be copied from Project to Project to transfer the settings.

A few extra Date Chronology checks have been added as suggested by John for Father and Daughters.

When comparing Place Name parts, any spaces and upper/lower case differences are disregarded similar to Individual Names.

If this version passes muster, then I will create structured Help & Advice pages and publish V3.0 in the Plugin Store.


ID:6524
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 15 Oct 2012 14:09

Mike,

Seems to be working great. Thanks!

One small thing. I had IntFathLastWrong and IntMothLastWrong set to -20. This doesn't seem to be possible in version 2.3. Looks like the max is now -10. Was this an intentional change?

Bill

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 15 Oct 2012 14:54

I could not remember how negative that setting needed to be.
I think I will adjust all the negative setting limits to be consistently -100.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
BillH
Megastar
Posts: 2179
Joined: 31 May 2010 03:40
Family Historian: V7
Location: Washington State, USA

Find Duplicate Individuals Version 2.3+

Post by BillH » 15 Oct 2012 16:26

Sounds great.

Thanks,

Bill

avatar
johncommins

Find Duplicate Individuals Version 2.3+

Post by johncommins » 18 Oct 2012 12:11

I list parents not found as not found 'SURNAME whatever' for male and not found marr 'SURNAME whatever' for female, on running duplicate,s
it gives them as a list where the names are listed but with a completely different one on the other one, no other names seem to be duplicates, only this as a not found/not found marr list
John

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 18 Oct 2012 13:44

Thank you for your feedback.
I think you are saying you have many Individuals with Forenames not found or not found marr and with the Surname of their children.

The matching is caused by the multiple matching Forenames, not only for the Individual, but also for their spouse, and despite a Gender mismatch.
However, these will all be quite low scoring matches.

There are several solutions:

1) Remove the spaces from those Forenames so they become NotFound or NotFoundMarr. This reduces the Forenames score below the Threshold for further matching.

2) Adjust the Plugin Set Preferences > Names Matching > Last Wrong to -1. This inhibits matching where the Surname differs, and will be the default in the next version of the Plugin.

3) Use the Plugin Omit Non-Duplicates tab, and move all the pairs into the exclusion list.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
johncommins

Find Duplicate Individuals Version 2.3+

Post by johncommins » 18 Oct 2012 16:25

Mike I have tried the first suggestion and now get the same result but also any person who has more than one christian name is also listed
John

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 18 Oct 2012 17:18

Mike, Any chance of having the option to move multiple entries to the non duplicates. It is quite time consuming on a large file to have to move them one at a time.
I have found it easier to check several (about a page full) first, before moving the non matches, as the plugin has to be opened and closed each time.
I still get quite an overload of two matching christian names and non matching surnames, but all the scores are quite low anyway.
Still a good plugin though.

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 18 Oct 2012 17:53

Could you both please try option 2) Adjust the Plugin Set Preferences > Names Matching > Last Wrong all to -1, which will be the default in the next version.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 18 Oct 2012 21:38

There are also other preference settings to tinker with.

Set Preferences > User Interface > Minimum Score can be increased a little to cut off low scores.

Set Preferences > Date Chronology > Chron Magnitude can be reduced.
Set Preferences > Date Chronology > Chron Tolerance can be increased.
These changes make Date mismatches more likely to exclude or lower the score of candidate duplicates.

Once all the candidate duplicates have been analysed and either merged or discounted, then on the Find Duplicates tab use Set the Updated from Date to this last run Date.
Thereafter, only Individual Records modified after this Updated from Date will be checked against all other Individual Records.
Thus all the previously checked candidate duplicates will be omitted unless a new match is found.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 21 Oct 2012 01:20

Further ideas...

Forename Duplicates
The name matching ignores punctuation characters, and names less than 3 characters long.
So change not found or NotFound to ? or ~ and it will be ignored.
To differentiate between male and female use Mr and Ms or similar 2-letter names, that are ignored.
With these changes the forenames will not match, and such individuals will not be listed as duplicates.

Omit Non-Duplicates
It is difficult to provide multiple selection of candidate pairs to omit.
In any case I think it would only save a few clicks.

What I have done in the next version is avoid having to rerun the Plugin Find any Duplicates... option when managing the Result Set of candidates.

The technique is:
1) Run the Plugin once on the whole database using the Find any Duplicates... button.
2) Work through the Result Set and identify some candidate pairs to Merge or mark as Non-Duplicates.
3) Having Merged some, then open the Plugin again and use the Omit Non-Duplicates tab as necessary.
4) Now the new feature is that the Show previous Result Set... button will omit the Merged pairs and the Non-Duplicates from the Result Set.
Steps 2) to 4) can be repeated as and when desired, without suffering the time penalty of using the Find any Duplicates... button.

When complete, use Set the Updated from Date to this last run Date.
Then only updated Individuals will be checked in future, reducing run time considerably.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 21 Oct 2012 09:18

One extra option I wondered about was using the Behind the Name API to look for forename variants when matching, eg Nellie for Helen.

http://www.behindthename.com/api/
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 21 Oct 2012 12:05

My first reaction is that the Plugin would almost immediately exceed most of the usage limits;
especially on large databases, assuming it used the Related Names API for all Forenames of all Individuals.
Although the documentation does say Usage limits may be increased upon request.

To use the API, it needs a registered account to get an API key.
If the Plugin only has one API key, then multiple users could also soon exceed the usage limits;
especially the 200000 requests per year.
Alternatively, every user would have to register manually, because a captcha prevents Plugin auto-registration.

Also the documentation says the Related Names API is only available by request.

Jane, had you any particular thoughts on how the Plugin might use the Related Names API?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 21 Oct 2012 15:58

Hi Mike ,
Your new thoughts about managing the omit duplicates feature sound good.
Another feature request if possible?
Is it possible to put an extra 'Are you Sure' stage in the 'Erase entire non duplicates list' action.
One wrong click and I have accidentally deleted my list.

By the way, the time estimate for my last run was good (estimate said 17 to 69 minutes, actual time was 71 mins then 5 minute wait for result set for 31960 records)

Highest score I am getting is a total of 38 and in that set I had genuine matches at 35 points 34,33 & 32.

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 21 Oct 2012 16:18

Mike, I'd have to re-read the TOC, but one thought would be to build a 'cache' on FHUG to parse the requests, so only new names were passed on to the main site. If you think it's worth exploring I don't mind building the cache script.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 21 Oct 2012 16:57

Jane, I think I'll put that feature in my potential things to do list for the time being, and focus on getting the current version stable enough to publish in the Plugin Store.

It does sound a useful extension, but assuming it does not violate the API Terms & Conditions, getting the FHUG cache populated initially poses some issues.
It would take a while to populate at only 1 request per second and 1000 requests per hour, presumably by using a Plugin with a list popular forenames, and throttled back to run slowly.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 21 Oct 2012 17:12

I agree it might take a while but along as we have a local and fhug cache it should be OK. I'll try and take a look at getting access at some point.
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 22 Oct 2012 01:04

John Morrison said:
actual time was 71 mins then 5 minute wait for result set for 31960 records
John, how big is your Result Set, because 5 minutes seems a long time to display a Result Set with the default maximum of 100 candidate pairs?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 22 Oct 2012 09:09

Hi Mike,
Result set is exactly 100 pairs, top score is 38, lowest score is 30.
Highest individual score was 33 (A non match).
Lowest Individual score was 3 (But 18 for Father and 11 for Mother -1 for Chrono total 31) (A Non Match)

I will run again tonight (Because I've lost my omit duplicates list, and I'll try and get an accurate time from disappearance of progress bar to appearance of the result set.
It did seem like quite a long time (Maybe 5 minutes was an over estimate)

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 22 Oct 2012 10:47

John, before doing that, use the Show previous Result Set of Duplicates in Family Historian button at the bottom, to see how long that takes to display.
I think it should only be a matter of seconds rather than minutes.

If you use Windows 7, there may be a quick way of getting your Omit Non-Duplicates List back from a previous version.
Use Windows Explorer and navigate to the Plugin Data folder at:
C:Users{user}DocumentsFamily Historian Projects{project}{project}.fh_dataPlugin Data
Right-click on the Find Duplicate Individuals.nondups file and select Properties from the menu.
Select the Previous Versions tab and wait for the list to appear.
Choose a version from the list and click the Restore button.
Job done!
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
RogerF
Famous
Posts: 182
Joined: 26 Apr 2009 16:32
Family Historian: V6.2
Location: Oxfordshire, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by RogerF » 22 Oct 2012 13:48

Following up on Jane's idea of handling first-name variants (eg Bill = Billy = Will = Willy = William = Willm.), an alternative approach might be to harness the power of FHUG itself and create our own variant database. Even if only 1% of our members participated, that's still 50 people -- enough, surely, to populate an initial table which can then be enhanced over time with member submissions. Just a thought.
Roger Firth, using FH to research the FIRTHs of Lancashire and Yorkshire, and the residents of the market town where I live.

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 22 Oct 2012 17:50

Hi Mike,Thanks for the help, Non Duplicates successively recovered.

When I choose display previous result set, it does only take a few seconds to display.

However, and I have found this with other plugins with a progress bar, after the plugin has completed and the progress bar goes away, there can be quite a wait while the result set is created.

Because I restored the pluhin data folder, the result set had reverted as well, so I am currently running the plugin with 31967 individuals, and I will try and time the gap if I can.

Could it be possible on a future version to have a negative value for place part wrong, rather than just a low positive one (I have set it to zero for now)

It is still one of several superb plugins that have proved immensely useful, especially on a large file.

User avatar
Jane
Site Admin
Posts: 8441
Joined: 01 Nov 2002 15:00
Family Historian: V7
Location: Somerset, England
Contact:

Find Duplicate Individuals Version 2.3+

Post by Jane » 22 Oct 2012 17:55

John just a thought, what version of V5 are you using?
Jane
My Family History : My Photography "Knowledge is knowing that a tomato is a fruit. Wisdom is not putting it in a fruit salad."

User avatar
johnmorrisoniom
Megastar
Posts: 882
Joined: 18 Dec 2008 07:40
Family Historian: V7
Location: Isle of Man

Find Duplicate Individuals Version 2.3+

Post by johnmorrisoniom » 22 Oct 2012 17:58

I'm on 5.0.7.
I always like to be up to date.

User avatar
tatewise
Megastar
Posts: 27079
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Find Duplicate Individuals Version 2.3+

Post by tatewise » 24 Oct 2012 00:29

The Find Duplicate Individuals Version 2.4 is now available for download.

Please give it a whirl!

The option to redisplay the previous Result Set without rerunning assessments is improved.

Omit Non-Duplicates tab is faster with a large Result Set.

Help & Advice is now structured, but needs some work for the Set Preferences tab.

The Progress Bar now only updates for each 1% step, so for very long running assessments may not change for many seconds.

Several other minor improvements.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

Post Reply