Page 3 of 4
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 20:51
by BillH
Mike,
I'm still getting a lot of non-duplicates on my report. Many of these are high on my report mixed in with the actual duplicates. Most of these would be eliminated or dropped way down if the mis-matched parents names didn' result in so many points.
Yes, that is correct. If both Names exist, and their matches result in zero points, then the deduction is applied. You can influence this with values of IntLastNameScore, IntForeNameRight and IntSoundexNames, which can be zero, and IntForeNameWrong that can even be negative.
Yes OK, I will add separate IntNamesDeductIndi=0, IntNamesDeductFath=-5, IntNamesDeductMoth=-5, IntNamesDeductSpou=0, IntNamesDeductChld=0 in the next Version.
If the new options you are adding work the same way, then will it really help all that much?
Since IntLastNameScore, IntForeNameRight IntSoundexNames and IntForeNameWrong apply to all name matches, if I set them to zero won't that also impact matching of names for the individuals themselves and for their spouses and children?
I would really like to leave name matching alone for the individuals, their spouses, and children and only impact mothers and fathers names if possible.
Ideally, if the mothers and fathers surnames are not an exact match, I would deduct so many points that the individuals ended up with no points or negative points.
If the surnames were an exact match, then I would deduct some points if the forenames did not match.
What might work is options something like this:
IntNamesDeductFathSurname
IntNamesDeductFathForenames
IntNamesDeductMothSurname
IntNamesDeductMothForenames
Rather than these working only if the names are a total mismatch, they would work if the names were not a total match. So... we would be able to deduct points if one mother was named Julia Ann Snodgrass and one was named Ann Marie Henshaw for example. The common Ann should not prevent points from being deducted.
Would something like that be possible?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 22:56
by tatewise
I think it would be possible to have a complete set of Name points for each Relative.
e.g.
IntIndiDeduction
IntIndiMaximum
IntIndiLastName
IntIndiForeRight
IntIndiForeWrong
IntIndiSoundex
IntFathDeduction
IntFathMaximum
IntFathLastName
IntFathForeRight
IntFathForeWrong
IntFathSoundex
and similarly for Moth, Spou and Chil.
Find Duplicate Individuals Version 1.5+
Posted: 15 Jul 2012 23:20
by BillH
Mike,
Of course if that could be done, that would be ideal.[smile]
Can this be set up so that the values can be negative (not just 0) so that deductions can be made if either the forenames are not exact matches or the surname is not an exact match (rather than only being able to deduct points if the entire name score is 0)?
Thanks!
Bill
Find Duplicate Individuals Version 1.5+
Posted: 16 Jul 2012 12:10
by tatewise
Bill ~ What if there was an IntFathMinimum threshold instead of zero, below which the IntFathDeduction would apply?
Thus, after all the existing Forenames and Surnames and Soundex scoring, the following would apply:
If the score = IntFathMaximum (default 20) then the score becomes IntFathMaximum as now.
So, since a Father's Surname mismatch scores zero, and any Forename and Soundex matches could be say 2 points each, then an IntFathMinimum of say 6 points would result in the IntFathDeduction if there were three or fewer matches.
The same scheme would apply for each Relative.
Lorna ~ I am looking at making the Chronology Mismatch Deduction proportional to the discrepancy.
So the more that Chronology Date Checks differ, then the greater the points deducted.
Thus a discrepancy of up to 10 years would only deduct 1 point, whereas a discrepancy of 90 to 100 years would deduct 10 points, i.e. 1 point per decade.
The years per point would be a User Preference Setting.
Find Duplicate Individuals Version 1.5+
Posted: 16 Jul 2012 17:43
by BillH
Mike,
I think that suggestion would work well if the father's surnames don't match, but not as well if the surnames do match.
I have a lot of individuals on my list where the father's surnames match, but the forenames don't match exactly. They are getting as many as 20 points under the column 'Father'. In one example the fathers are 'Thomas P Inshaw' and 'Edwin Thomas Inshaw'. For this same pair, the mother's names are 'Sarah Ann' and 'Ada'. These two are obviously not a match, yet they get a total of 35 points and are the 4th highest pair on my report ahead of some actual duplicates.
If I set IntFathMinimum to something higher than 20 to eliminate this pair, then I think I'd also eliminate some of my actual duplicates. Maybe I'm just confused and you can correct me on this?
I think, if it is possible to code, my earlier suggestion might be a way to handle it. If the mother's and father's surnames are not an exact match, allow a deduction of points. If the surnames are an exact match, but the forenames are not an exact match, allow a deduction of points.
Thanks for continuing to pursue this!
Bill
Find Duplicate Individuals Version 1.5+
Posted: 16 Jul 2012 19:04
by tatewise
I find it difficult to understand the scores you quote with only partial information.
With the current V1.7 default scoring:-
Thomas P Inshaw v. Edwin Thomas Inshaw should score 10 points = 7 Surname + 3 Forename wrong position.
Plus other points for Event matches to get Father column score.
Sarah Ann v. Ada should mismatch and score 0 points in Mother column.
No points should be added even if Events match.
So I am not clear where the total of 35 is coming from.
Remember, each Father, Mother, etc column is made up of 5 component scores; the Names match, and four Event matches.
The Names match must score half of IntNamesMatched before any Event points are added. (I forgot to mention this in the WiP Help page.)
So in above Father Names match, if Forename in wrong position = 0 points, and IntFathMinimum were set to 8 points, then IntFathDeduction would apply and no Event scores added.
Having said all that, you clearly 'know' the above Individual pair & Parents are NOT duplicates, but from my independent stand-point they look feasible 'fuzzy' Name matches, that with corroborating Events, for these Parents, or the Individual, or other Relatives makes them possible candidates.
Find Duplicate Individuals Version 1.5+
Posted: 16 Jul 2012 20:08
by LornaCraig
I am looking at making the Chronology Mismatch Deduction proportional to the discrepancy.
So the more that Chronology Date Checks differ, then the greater the points deducted.
Thus a discrepancy of up to 10 years would only deduct 1 point, whereas a discrepancy of 90 to 100 years would deduct 10 points, i.e. 1 point per decade.
The years per point would be a User Preference Setting.
Mike,
This sounds like a good idea, but would it be in addition to, or an alternative to:
If more than 2 Chronology checks fail then the pair is excluded from the results.
The latter, currently in force, is definitely a good idea. But if you introduced the 'proportional' chronolgy mismatch how many points would need to be deducted for it to count as 'failing' the chronology test? I suppose that would have to be another User Preference Setting.
Find Duplicate Individuals Version 1.5+
Posted: 16 Jul 2012 21:48
by BillH
Mike,
Sorry... I didn't give you enough details. 13 points were for the name match on the individuals themselves, 2 points were for the individuals birth fact, and 20 under the Father total (I guess that is 10 points for the name and 10 points for the birth fact). Really the only point I was trying to make (although I did it poorly) was that out of the 35 total points, 20 of them were under the Father column. With your suggested options of IntFathDeduction, IntFathMaximum, IntFathLastName, IntFathForeRight, IntFathForeWrong, and IntFathSoundex I think I could adjust these to make the father have less of an influence on the total.
Thanks for the explanation on how the name match must score half of IntNamesMatched before any event points are added.
I must be too close to the data. To me, two individuals with different names who have fathers with different names and mothers with different names are probably not a match. I am just trying to find a way to get these types of pairs to not be at the top of my list hiding the real matches which usually have less points.
When your next version comes out, I'll try playing around with the new options and see what I come up with.
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 17 Jul 2012 09:25
by johnmorrisoniom
Mike,
I have run vers 1.7 on an older copy of my file (Before I started using the duplicates plugin) and am currently working through the data to compile a spreadsheet for you.
I am going to use this file for all future testing, so that I can gauge the results better. I should finish the spreadsheet tomorrow and will email it to you.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 00:25
by tatewise
The
Find Duplicate Individuals Version 1.8 is now available for download.
It incorporates many of your recent suggestions ~ see the
WiP Help page for details.
It now allows the previous
Result Set to be redisplayed at any time, so you can continue working on it without reassessing your database.
With
Enable Diagnostic Mode you may optionally select
Including Timespan Dates.
All the
User Preference Settings for
Name Matching have a separate set for
Individual,
Father,
Mother,
Spouse, and
Child.
The
Deduction,
Minimum and
Maximum are as discussed above, plus a
Threshold needed to proceed with
Event Assessments, etc.
Event Assessment also now has a
Minimum needed to avoid entire
Event mismatch.
The
Timespan to extend 'After', 'Before', 'From', 'To' now defaults to
50 Years instead of 10 Years.
Chronology proportional scoring deducts
1 point for each
Year of discrepancy.
If more than
20 points are deducted then the
Individuals are excluded.
You can set limits on lowest value and maximum rows to display in the
Result Set.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 04:29
by BillH
Mike,
I tried version 1.8, but I am getting results that I don't understand.
If the score <= IntFathMinimum (default 0) then the score becomes IntFathDeduction.
Thomas P Inshaw v. Edwin Thomas Inshaw should score 10 points = 7 Surname + 3 Forename wrong position.
Plus other points for Event matches to get Father column score.
So in above Father Names match, if Forename in wrong position = 0 points, and IntFathMinimum were set to 8 points, then IntFathDeduction would apply and no Event scores added.
I am looking at the same pair. The fathers names are as listed above. I used the following settings for the Father Name Match Settings, but I am seeing 21 points in the Father column.
IntFathDeduction = -5
IntFathMinimum = 10
IntFathMaximum = 20
IntFathThreshold = 5
IntFathLastName = 7
IntFathForeRight = 6
IntFathForeWrong = 0
IntFathSoundex = 2
I would have thought I would have 7 points for the name. This is less than the 10 points that I have for IntFathMinimum so I would not have expected to see a total of 21 points under Father (I thought no event points would have been counted). So I would have expected to end up with -5 points under the Father column.
Did I misunderstand?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 13:05
by tatewise
The
Find Duplicate Individuals Version 1.9 is now available for download to fix the bug that Bill just highlighted ~ Sorry Bill ~ My mistake!
See the
WiP Help page for details.
Bill ~ In your particular example, I agree the match should only score
7, so
IntFathMinimum can now be as low as
8, and still yield the
-5 points deduction.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 15:39
by Valkrider
Mike
Is the plug-in supposed to look at places as well as names? If it is then it doesn't seem to be as I am getting Middlesex births being shown as duplicates with Lincolnshire births for instance with a score of ~12.
This latest version seems to be showing a lot more possibles with scores of as low as 7 than the 1.6 version. Just FYI
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 15:58
by tatewise
Colin ~ Please read the WiP help page.
Select the Diagnostic Mode and redisplay the Result Set to see how scores are broken down.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 17:00
by Valkrider
Mike
I have but I don't see any heading for scoring about birth place just for ibirth. Am I missing something? Surely date and location should be regarded as 2 separate entities.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 18:00
by BillH
Mike,
Version 1.9 took care of the problem I reported. It is now working great. Thanks for the quick fix!
The results are now looking much better. I've actually found some duplicates that were hidden by all the non-duplicates before.
I am seeing one thing that I can't figure out how to handle. I am getting a lot of pairs where the surnames are different, but the forenames match and the forenames are two parts like 'Margaret Ann' or 'Mary Elizabeth'. These folks end up with enough points to get high up my list, but they aren't duplicates. Is there someway to handle these that I'm missing?
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 20:23
by LornaCraig
Like Bill, I am seeing a number of individuals being matched because of identical pairs of forenames. For example Mary Ann was a very common combination, often used if it were a single name (some Mary Anns are occasionally transcribed Marianne). There is no obvious solution to this, as it is important to take second forenames into account: they can be key to matching or distinguishing people. Another source of unexpected scores for name matching the Alt name field. I use this to record alternative spellings of the same surname, so if two individuals have the same alternative spellings of what is in fact just one name, it artificially boosts the score. Again, there is probably no way round this.
I am not sure about the new progressive scoring for chronology mismatches. On one level I feel that any chronological incompatibility should mean that the pair are removed from the list. If one person was buried in April 1780 and another was baptised in November 1780 they cannot be the same person. So it is tempting to reduce the Degree of mismatch tolerated before excluding Individuals to 1 point. However this leaves no room for clerical error in either an original document or a modern transcription, so it could be risky.
I know the points for a chronological mismatch can be adjusted to more than 1 year per point, but can they be changed to more than 1 point per year? Could 5 or 10 points be deducted per year? This would move the pair further down the list without removing them entirely. It would also bring the score for a close chronological mismatch more into line with the default 15 points deduction for an entire event mismatch. It doesnt seem right to deduct 15 points for an event mismatch (say two baptisms on specific dates six months apart) but as little as 1 point for a chronology mismatch. After all, one person being buried before the other was baptised implies a double event mismatch: their baptisms and burials must both mismatch.
Having said that, the plugin is working well and has helped me establish that I have no definite duplicates and only a few possibles. There might be some refinements that could help people with very large databases but it is already more than adequate for mine (approx 3500 individuals). Thank you.
Find Duplicate Individuals Version 1.5+
Posted: 18 Jul 2012 22:04
by tatewise
Colin ~ If
Event Dates do not match (within the tolerances defined), then
Places are not checked. I took this view because
Places for family events tend to recur a lot, and scoring points for matching
Places when
Dates don't agree would give them too much importance. See the
WiP Help on Event Assessment for details. You can also make some adjustment to the
User Preference Settings for
Event Assessment at the head of the Plugin script.
The latest versions are letting a few more lower scores through to the results, but this can be adjusted in the
User Preference Settings for the
User Interface at the head of the Plugin script.
Bill & Lorna ~ Regarding multiple
Name matches, the only thing you can do is reduce the
IntXxxxForeRight points for a Forename in the right position, and reduce the
IntXxxxMaximum points that was introduced way back to prevent swamping the results with multiple Name matches.
Lorna ~ I had similar arguments with myself about
Chronology scoring. Like you, I want to allow for clerical errors. I assumed that if there is one
Chronology error, then there are probably others, and each one deducts points. Also,
Event Assessment is only performed on
Real Dates, whereas
Chronology checks often use
Synthetic Dates that are derived from estimates for
Lifespan, etc. What I wanted to do was exclude gross
Chronology errors of many years, and let the others influence the results, but the balance may need improving.
REMEMBER NO SCORING SYSTEM WILL BE PERFECT
That is why I think the next version may include the
Non-Duplicates Management feature discussed earlier to deal with all these anomalies.
Find Duplicate Individuals Version 1.5+
Posted: 19 Jul 2012 00:01
by BillH
Mike,
I really do like this plugin. I think you have done an amazing job in a very short time. I have already found 8 or 9 duplicates and I didn't think that I had any.
I won't belabor this point and promise not to bring it up again. It would really be nice if we could have an option to allow us to deduct points if the surname doesn't match for the Father and Mother. However, I do understand that not everything is possible. I'll happily use the plugin as it is if that isn't possible.
Thanks for all your hard work!
Bill
Find Duplicate Individuals Version 1.5+
Posted: 20 Jul 2012 00:46
by tatewise
Bill ~ Now that I have got most of the other stuff out of the way, I might be able to offer what you ask, without it having too big a run-time impact.
The reason it is run-time sensitive, is that Name matching is performed more often than any other checks.
Just to confirm, are you requesting one optional Deduction when BOTH (1) the Father's Surname's have no match AND (2) the Mother's Surname's have no match?
OR do you mean an optional Deduction when the Father's Surname's have no match,
AND a separate optional Deduction when the Mother's Surname's have no match?
For my proposed scheme to work, the only condition is that the points for IntFathLastName must be different from all of IntFathForeRight & IntFathForeWrong & IntFathSoundex.
A similar condition applies to IntMothLastName.
FYI:
I have implemented a Non-Duplicates Management feature for the next version, along the lines discussed earlier.
It allows any pair of Individuals in the Result Set to be added to a list of pairs of Non-Duplicates, so those pairs will be excluded in future.
Entries in the Non-Duplicates list may also be removed at any time.
The only condition is that Individual Record Id must not be changed, otherwise any saved Result Set and Non-Duplicates lists will be invalidated.
Find Duplicate Individuals Version 1.5+
Posted: 20 Jul 2012 01:38
by BillH
OR do you mean an optional Deduction when the Father's Surname's have no match,
AND a separate optional Deduction when the Mother's Surname's have no match?
This is what I was hoping for. Two options. One if the fathers surnames are not an exact match. A separate one if the mothers surnames are not an exact match.
For my proposed scheme to work, the only condition is that the points for IntFathLastName must be different from all of IntFathForeRight & IntFathForeWrong & IntFathSoundex.
This should not be a problem, I have more points going to IntFathLastName than any of the other options. Same for IntMothLastName.
Thanks!
Bill
Find Duplicate Individuals Version 1.5+
Posted: 20 Jul 2012 11:04
by tatewise
OK, delving into the details a bit deeper throws up these questions.
Lets call the deduction points setting IntFathLastError.
(1) What happens to points associated with IntFathForeRight & IntFathForeWrong & IntFathSoundex?
When IntFathLastError deduction applies, are any points accumulated for Forename and Soundex matches cancelled for that Father v Father Name comparison? i.e. The score will simply be the IntFathLastError deduction.
OR
When IntFathLastError deduction applies, are any points accumulated for Forename and Soundex matches retained, and IntFathLastError deducted from the running total?
(2) In this latter case, what happens if the resulting total is less than IntFathMinimum? ~ Does IntFathDeduction then apply instead?
(3) A different scheme would be to have a TRUE/FALSE flag for the LastName mismatch optional setting.
If this flag were TRUE, as you would have it for Father's, then if there was a complete Father's LastName mismatch then IntFathDeduction would be the resultant score.
Find Duplicate Individuals Version 1.5+
Posted: 20 Jul 2012 17:25
by BillH
Mike,
(1)When IntFathLastError deduction applies, are any points accumulated for Forename and Soundex matches cancelled for that Father v Father Name comparison? i.e. The score will simply be the IntFathLastError deduction.
This is how I was thinking it could work.
(2) In this latter case, what happens if the resulting total is less than IntFathMinimum? ~ Does IntFathDeduction then apply instead?
Good question. I guess I was thinking we would use the IntFathLastError value in this case. I would make this value large enough to make sure the total value for Father is 0 or negative. I think that using IntFathDeduction would also work, but maybe having two options would give people more control. Someone may want a large deduction for last name total mismatches, but a smaller deduction for other mismatches.
(3) A different scheme would be to have a TRUE/FALSE flag...
See #2.
Either way will work for me.
Thanks,
Bill
Find Duplicate Individuals Version 1.5+
Posted: 21 Jul 2012 21:44
by tatewise
The
Find Duplicate Individuals Version 2.0 is now available for download.
The main new feature is the
Omit Non-Duplicates tab. This allows pairs of
Candidate Duplicates from the
Result Set, that have been determined as
Non-Duplicates, to be added to a list of
Non-Duplicates for future exclusion. The user interface turned out to be easier to implement than I expected. It is still a bit raw, and may need refinement in the future.
Another change is a new
IntXxxxLastWrong user setting, as requested by Bill, where
Xxxx is any relative such as
Father or
Mother. If set non-zero, then that number of points is deducted if the
Lastnames do not match, regardless of any
Forename or
Soundex matches. The default value of zero disables this feature.
The
IntChronMagnitude user setting is now in
Months instead of
Years as suggested by Lorna's comments. The default is
12 Months, i.e.
1 Year as it was before, but by setting it to say
2 Months, then
1 Point would be deducted per bi-monthly chronology error, i.e.
6 Points per annum.
Find Duplicate Individuals Version 1.5+
Posted: 23 Jul 2012 18:00
by mikegscoles
I must be missing something.
I added myself and DOB as a child of my father to see if the plug in would show my name as a duplicate as I am already the child of my father in my tree. It didn't. So I changed the spelling slightly - it still didn't show. I saved my duplicate addition in case that made a difference - it didn't.
Is there a simple explanation?