* Find Duplicate Place Names
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Find Duplicate Place Names
Right, I feel as though I'm stood on top of a 30 metre diving board with only 1 metre of water below me!
Find Duplicate Place Names plugin
These are the thoughts buzzing round my head at the moment
Note: I have used the diamond symbol ◊ to represent a space for clarity
Aim
Family Historian doesn’t allow duplicate Place Names.
However, if we try and tidy mis-formed Place Names, this may result in duplicates
To identify Place Names, which after tidying would result in a duplicate address.
e.g.
"Salford,◊◊Lancashire” held in Place Record Id [403] – Has 2 adjacent spaces before Lancashire
“Salford,◊Lancashire” held in Place Record Id [167] – No problems with this entry
However, if we tidied the first entry then it would create a duplicate.
Step 1 – Define the criteria to identify a candidate for tidying
My initial criteria would be:
Multiple adjacent spaces . . . . . e.g. ‘Bradford,◊◊Yorkshire, England’
Multiple adjacent commas* . . e.g. ‘Bradford,, Yorkshire, England’
Leading space . . . . . . . . . . . . . e.g. ‘◊Bradford, Yorkshire, England’
Leading comma* . . . . . . . . . . . e.g. ‘,Bradford, Yorkshire, England’
Trailing space . . . . . . . . . . . . . . e.g. ‘Bradford, Yorkshire, England◊‘
Trailing comma . . . . . . . . . . . . e.g. ‘Bradford, Yorkshire, England,’
Space preceding a comma . . . e.g. ‘Bradford◊, Yorkshire, England’
*It is possible that a user may intentionally use extra commas to force positioning
Step 2 – Create a list of all Places placelist
Do I need to step through the GedCom or can I access FH’s internal list?
My initial thought was that fhGetDataList("PLACES") would work but whilst this returns a list of Places, it doesn’t return the record Id.
Do I need the record ID? It would be useful.
Step 3 – Identify any Place that matches the criteria shown in Step 1
Add offending Places to a list
Search through placelist and look for any offending places
Could use the Lua equivalent of Grep?
Copy the record into a list invalidlist
Remove from placelist any entries described as invalid
This means that at the end, placelist will contain only valid Places
Step 4 - Tidy the Places that need tidying
This needs to be done based on the invalid list created in Step 3
Step through invalidlist and correct the errors
Could use the Lua equivalent of Grep?
At this point we have 2 lists
placelists – contains valid Places
invalidlist – contains invalid Places that have now been corrected
Step 5 – Compare the tidied list to with existing Places to see if duplicate
Compare all entries in invalidlist with placelist
If there are any matches, then these need to be identified
Need to return the Place name and Id from invalidlist and the matching values in placelist
It would also be useful to return those Places identified as invalid but that don’t clash with valid Places. This would allow the user the opportunity to correct them.
Dave
Find Duplicate Place Names plugin
These are the thoughts buzzing round my head at the moment
Note: I have used the diamond symbol ◊ to represent a space for clarity
Aim
Family Historian doesn’t allow duplicate Place Names.
However, if we try and tidy mis-formed Place Names, this may result in duplicates
To identify Place Names, which after tidying would result in a duplicate address.
e.g.
"Salford,◊◊Lancashire” held in Place Record Id [403] – Has 2 adjacent spaces before Lancashire
“Salford,◊Lancashire” held in Place Record Id [167] – No problems with this entry
However, if we tidied the first entry then it would create a duplicate.
Step 1 – Define the criteria to identify a candidate for tidying
My initial criteria would be:
Multiple adjacent spaces . . . . . e.g. ‘Bradford,◊◊Yorkshire, England’
Multiple adjacent commas* . . e.g. ‘Bradford,, Yorkshire, England’
Leading space . . . . . . . . . . . . . e.g. ‘◊Bradford, Yorkshire, England’
Leading comma* . . . . . . . . . . . e.g. ‘,Bradford, Yorkshire, England’
Trailing space . . . . . . . . . . . . . . e.g. ‘Bradford, Yorkshire, England◊‘
Trailing comma . . . . . . . . . . . . e.g. ‘Bradford, Yorkshire, England,’
Space preceding a comma . . . e.g. ‘Bradford◊, Yorkshire, England’
*It is possible that a user may intentionally use extra commas to force positioning
Step 2 – Create a list of all Places placelist
Do I need to step through the GedCom or can I access FH’s internal list?
My initial thought was that fhGetDataList("PLACES") would work but whilst this returns a list of Places, it doesn’t return the record Id.
Do I need the record ID? It would be useful.
Step 3 – Identify any Place that matches the criteria shown in Step 1
Add offending Places to a list
Search through placelist and look for any offending places
Could use the Lua equivalent of Grep?
Copy the record into a list invalidlist
Remove from placelist any entries described as invalid
This means that at the end, placelist will contain only valid Places
Step 4 - Tidy the Places that need tidying
This needs to be done based on the invalid list created in Step 3
Step through invalidlist and correct the errors
Could use the Lua equivalent of Grep?
At this point we have 2 lists
placelists – contains valid Places
invalidlist – contains invalid Places that have now been corrected
Step 5 – Compare the tidied list to with existing Places to see if duplicate
Compare all entries in invalidlist with placelist
If there are any matches, then these need to be identified
Need to return the Place name and Id from invalidlist and the matching values in placelist
It would also be useful to return those Places identified as invalid but that don’t clash with valid Places. This would allow the user the opportunity to correct them.
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
Here are some tips to avoid a fatal belly-flop!
Step 1
Look at the built-in FH =TextPart(...) function.
The example =TextPart(%INDI.BIRT.PLAC%, 1, 0, TIDY) meets all your criteria I think.
The fhCallBuiltInFunction(...) will call that function from within a plugin.
Step 2
You have obviously found the How to Write Plugins > Introduction to Family Historian Plugins.
See the Sample Plugin Scripts that have many useful examples.
The Surname Summary shows how to loop through records and read their name.
Just use "_PLAC" records instead of "INDI" records.
Yes, you will need Record Id and there is a plugin function for that.
Step 3 - Step 5
Lua has a powerful table structure that is often overlooked and replaces dedicated features in other languages.
I suggest using each tidied Place name as the index to a table that holds the pointer to the original record.
If such an index entry already exists when a later record is tidied then bingo, a duplicate has been found and both record pointers are known, can be saved and reported in pairs in a Result Set.
Step 1
Look at the built-in FH =TextPart(...) function.
The example =TextPart(%INDI.BIRT.PLAC%, 1, 0, TIDY) meets all your criteria I think.
The fhCallBuiltInFunction(...) will call that function from within a plugin.
Step 2
You have obviously found the How to Write Plugins > Introduction to Family Historian Plugins.
See the Sample Plugin Scripts that have many useful examples.
The Surname Summary shows how to loop through records and read their name.
Just use "_PLAC" records instead of "INDI" records.
Yes, you will need Record Id and there is a plugin function for that.
Step 3 - Step 5
Lua has a powerful table structure that is often overlooked and replaces dedicated features in other languages.
I suggest using each tidied Place name as the index to a table that holds the pointer to the original record.
If such an index entry already exists when a later record is tidied then bingo, a duplicate has been found and both record pointers are known, can be saved and reported in pairs in a Result Set.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Mike
Thanks for the tips, I'll start on it next week and see how I get on.
Dave
Thanks for the tips, I'll start on it next week and see how I get on.
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
I’ve got my prototype running.
It returns 3 columns:
Malformed Placenames – list of malformed place names
Correction – List of what they should be corrected to
Conflicts with – List of corrected place names which would conflict with another, existing place name
e.g.
Malformed Place Names
,Bristol, England
Florida, USA,
Kircauldy, Scotland
New Jersey,,, USA
Redland, Bristol , England
Correction
Bristol, England
Florida, USA
Kircauldy, Scotland
New Jersey, USA
Redland, Bristol, England
Conflicts With
Bristol, England
""
""
""
Redland, Bristol, England
At this stage there are no record links – that’s for the next stage.
It all works as expected.
Unfortunately, I have a speed problem with databases that have a lot of places.
This is a problem of my own making
To check if a Corrected place name will conflict with an existing ‘good’ place name:
I step through the table of corrected place names
then compare each place name with all the place names in the database.
The speed of this is dependent on
a) How many corrections there are
b) How many place names are in the main table.
These are the statistics for three tests:-
Errors 0
Places 1,241
Run time 2 seconds
Errors 19
Places 4,527
Run time 15 seconds
Errors 494
Places 13,240
Run time 10 minutes
The question
Is there a faster/better way of trying to find a match for a correction when comparing with the main table of place names?
Many thanks
It returns 3 columns:
Malformed Placenames – list of malformed place names
Correction – List of what they should be corrected to
Conflicts with – List of corrected place names which would conflict with another, existing place name
e.g.
Malformed Place Names
,Bristol, England
Florida, USA,
Kircauldy, Scotland
New Jersey,,, USA
Redland, Bristol , England
Correction
Bristol, England
Florida, USA
Kircauldy, Scotland
New Jersey, USA
Redland, Bristol, England
Conflicts With
Bristol, England
""
""
""
Redland, Bristol, England
At this stage there are no record links – that’s for the next stage.
It all works as expected.
Unfortunately, I have a speed problem with databases that have a lot of places.
This is a problem of my own making
To check if a Corrected place name will conflict with an existing ‘good’ place name:
I step through the table of corrected place names
then compare each place name with all the place names in the database.
The speed of this is dependent on
a) How many corrections there are
b) How many place names are in the main table.
These are the statistics for three tests:-
Errors 0
Places 1,241
Run time 2 seconds
Errors 19
Places 4,527
Run time 15 seconds
Errors 494
Places 13,240
Run time 10 minutes
The question
Is there a faster/better way of trying to find a match for a correction when comparing with the main table of place names?
Code: Select all
tblConflicts = {} -- Table to hold placenames where there is a conflict
ii=0 -- Row counter so conflicted placenames go in the right row to match tblCorrected
for iRec, strPlacename in pairs(tblCorrected) do --Pick up a place name
ii=ii+ -- Increment row counter
tblConflicts[ii]="" --Pre-populate the row with a blank in case no conflict found
for strOrigplace in pairs(tblPlacenames) do --With placename loop through originals to find a match
if strPlacename == strOrigplace then -- If match found then add to tblConflicts
tblConflicts[ii]=strOrigplace --in the matching row
--else --Tried this but didn't populate the row
--tblConflicts[ii]="" --so pre-populated at start of run
end
end
end
Many thanks
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
Excellent.
Yes, there is a very fast way of looking up duplications using a table.
Use the tidied place name as an index key into a table and save the pointer to the Place record as its value.
But if the table index key already has a value then a conflicting duplicate has been found and the details logged.
This algorithm means the Place records are only traversed once.
e.g.
The difference from your Plugin is that this only lists pairs of conflicting Place records, which I thought was the objective.
I can see that it might be useful to list records whose current name differs from the tidied version.
One snag is that the tidied version purges leading and trailing commas & spaces which might not be wanted.
This part of the exercise, getting the specification of objectives clear, is an important stage in the design process.
If run time is still a potential issue for large Projects then a Progress Bar will be needed.
There is a code snippet for that but you will discover that the script to add that 'user interface' feature will be larger than the basic search algorithm!
There are tips for populating tables that avoid needing a counter and for other table operations but they can wait until later.
Yes, there is a very fast way of looking up duplications using a table.
Use the tidied place name as an index key into a table and save the pointer to the Place record as its value.
But if the table index key already has a value then a conflicting duplicate has been found and the details logged.
This algorithm means the Place records are only traversed once.
e.g.
Code: Select all
local dicRec = {}
-- < loop through Place records using ptrRec >
local strText = fhGetItemText(ptrRec,"~.TEXT")
local strTidy = fhCallBuiltInFunction("TextPart",strText,1,0,"TIDY")
local ptrTidy = dicRec[strTidy]
if ptrTidy then -- Duplicate tidied name found so save Result Set data of both records
-- < log tidy name (strTidy) and both Place records (ptrTidy & ptrRec) with their Rec Id to Result Set >
else
dicRec[strTidy] = ptrRec:Clone() -- Save pointer to Place record against tidied name
end
ptrRec:MoveNext()
end
-- < display Result Set >
I can see that it might be useful to list records whose current name differs from the tidied version.
One snag is that the tidied version purges leading and trailing commas & spaces which might not be wanted.
This part of the exercise, getting the specification of objectives clear, is an important stage in the design process.
If run time is still a potential issue for large Projects then a Progress Bar will be needed.
There is a code snippet for that but you will discover that the script to add that 'user interface' feature will be larger than the basic search algorithm!
There are tips for populating tables that avoid needing a counter and for other table operations but they can wait until later.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Thanks a lot Mike.
Job for tomorrow
Dave
Job for tomorrow
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Mike
I've followed your tip for table searching and it works.
10 minutes down to 10 seconds
Would you be prepared to give the plugin a try for me please?
Many thanks
Dave
I've followed your tip for table searching and it works.
10 minutes down to 10 seconds
Would you be prepared to give the plugin a try for me please?
Many thanks
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
Yes, attach it as a file to your posting using the ATTACHMENTS tab and I will review it.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Mike
Many thanks
Attached are
the plugin - currently called Tidy Place name check
a gedcom - Test FH Sample Project.ged -
.....This is a butchered version with only 16 Places
.....5 of which are malformed
.....and 2 of the 5 would cause conflicts if renamed.
Should return a table with 3 columns
5 entries in cols 1 and 2
3 entries in col 3
Apologies for the copious amount of REM statements
Don't know why, but I couldn't get count = #tblToTidy to work [Also failed with all tables]
Many thanks
Dave
Many thanks
Attached are
the plugin - currently called Tidy Place name check
a gedcom - Test FH Sample Project.ged -
.....This is a butchered version with only 16 Places
.....5 of which are malformed
.....and 2 of the 5 would cause conflicts if renamed.
Should return a table with 3 columns
5 entries in cols 1 and 2
3 entries in col 3
Apologies for the copious amount of REM statements
Don't know why, but I couldn't get count = #tblToTidy to work [Also failed with all tables]
Many thanks
Dave
- Attachments
-
Tidy Place name check.fh_lua- The plugin
- (9.37 KiB) Downloaded 115 times
-
Test FH Sample Project.ged- Sample gedcom with place name issues
- (87.05 KiB) Downloaded 104 times
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
I think that I've sussed out why
count = #tblToTidy
didn't work.
After a bit of experimentation it would appear that tables that are created only using the 'Value' and leaving Lua to provide the key e.g.
1 "Peter"
2 "Paul"
3 "Mary"
Will return the row count.
Whereas tables where the use has provided both the Key and the Value e.g.
Father "Peter"
Brother "Paul"
Mother "Mary"
The row count will not be returned so need to loop through the rows to increment a counter
Dave
count = #tblToTidy
didn't work.
After a bit of experimentation it would appear that tables that are created only using the 'Value' and leaving Lua to provide the key e.g.
1 "Peter"
2 "Paul"
3 "Mary"
Will return the row count.
Whereas tables where the use has provided both the Key and the Value e.g.
Father "Peter"
Brother "Paul"
Mother "Mary"
The row count will not be returned so need to loop through the rows to increment a counter
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
You are close.
Essentially there are two types of table index key:
(1) [non-integer] that form an unordered lookup dictionary but has no # size value.
(2) [integer] that form an ordered list that does have a # size value if the keys start at 1 and are sequential. These can also be sorted on their value and the values get reassigned to the keys. However, integer keys do not have to be sequential.
Lua does create integer keys automatically in some syntax structures, but they are still user-controlled keys.
The Reference Guide gives various examples.
To confuse matters further, a table can have both types of index key.
for key, value in pairs (table) do will traverse both types of index key in an unpredictable order.
for key, value in ipairs (table) do will traverse only integer index keys in ascending order.
P.S. I've started reviewing your Plugin but have got distracted.
Essentially there are two types of table index key:
(1) [non-integer] that form an unordered lookup dictionary but has no # size value.
(2) [integer] that form an ordered list that does have a # size value if the keys start at 1 and are sequential. These can also be sorted on their value and the values get reassigned to the keys. However, integer keys do not have to be sequential.
Lua does create integer keys automatically in some syntax structures, but they are still user-controlled keys.
The Reference Guide gives various examples.
To confuse matters further, a table can have both types of index key.
for key, value in pairs (table) do will traverse both types of index key in an unpredictable order.
for key, value in ipairs (table) do will traverse only integer index keys in ascending order.
P.S. I've started reviewing your Plugin but have got distracted.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
Re: Find Duplicate Place Names
Looking good.
given these situations:
' Tenterden, Kent, EN, GBR' leading single space
'Tenterden, Kent, EN, GBR ' ending single space
and
'Tenterden, Kent, EN, GBR, ' ending comma and ending single space
are you handling them as untidy?
given these situations:
' Tenterden, Kent, EN, GBR' leading single space
'Tenterden, Kent, EN, GBR ' ending single space
and
'Tenterden, Kent, EN, GBR, ' ending comma and ending single space
are you handling them as untidy?
FH V.6.2.7 Win 10 64 bit
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Ron
I haven't trapped for leading or trailing spaces because FH appears to handle them itself.
If you go to Tools and Choose work with Data, Places
Then edit an entry to add leading/trailing spaces, FH will ignore them.
I haven't trapped for leading or trailing spaces because FH appears to handle them itself.
If you go to Tools and Choose work with Data, Places
Then edit an entry to add leading/trailing spaces, FH will ignore them.
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
Re: Find Duplicate Place Names
I am unsure of that assertion.
But I am on 6.7.2
do the following, on somewhere you can find it again
add the place:
, Valley, Marshall, MN, USA
then on another screen
add the place
, Valley, Marshall, MN, USA
where one is an immediate comma and the other blank comma otherwise exactly alike
do a reverse and sort in wwd>places
you should see two records near each other.
But I am on 6.7.2
do the following, on somewhere you can find it again
add the place:
, Valley, Marshall, MN, USA
then on another screen
add the place
, Valley, Marshall, MN, USA
where one is an immediate comma and the other blank comma otherwise exactly alike
do a reverse and sort in wwd>places
you should see two records near each other.
FH V.6.2.7 Win 10 64 bit
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Ron
On both v6.2 and v7
If I enter the Place for an event with a leading space, as soon as I move to another field you can see FH removing the leading space.
On both v6.2 and v7
If I enter the Place for an event with a leading space, as soon as I move to another field you can see FH removing the leading space.
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
Re: Find Duplicate Place Names
question, if you go into wwd> places on that record, and edit and put a leading space in there, does it stay?
FH V.6.2.7 Win 10 64 bit
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Tom
Just sitting down for dinner - will try and have a look later.
Can you clarify what is meant by www please?
Just sitting down for dinner - will try and have a look later.
Can you clarify what is meant by www please?
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
You'll get used to Ron's abbreviations. wwd = Tools > Work with Data
so wwd > places means Tools > Work with Data > Places
so wwd > places means Tools > Work with Data > Places
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Folks
Opened, in wwd, an existing place - quotes are to show start & end
'Adelaide,South Australia, Australia'
changed it to
' Adelaide,South Australia, Australia '
Clicked OK
Opened it again to edit
'Adelaide,South Australia, Australia'
FH has trimmed whitespace front and back
Pretty conclusive I believe
Opened, in wwd, an existing place - quotes are to show start & end
'Adelaide,South Australia, Australia'
changed it to
' Adelaide,South Australia, Australia '
Clicked OK
Opened it again to edit
'Adelaide,South Australia, Australia'
FH has trimmed whitespace front and back
Pretty conclusive I believe
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
Dave, sorry for the delay in responding, but here goes...
(1) Plugin Header
The header is missing some necessary FH V7 fields (that are also FH V6 compatible).
Use Edit > Insert Script Header to get all the fields of which @Type: and @Licence: are crucial.
(2) Opening Message
The first message mentions "malformed addresses" rather than places, which could confuse users.
(3) Place Tidy
The tidying is the same as that offered by the =TextPart(...) built-in FH function.
So your two functions could be replaced by a call of that TextPart function:
tidyCheck(strPlacename) => if strPlacename ~= fhCallBuiltInFunction("TextPart",strPlacename,1,0,"TIDY") then
tidyPlacename(strPlacename) => strPlacename = fhCallBuiltInFunction("TextPart",strPlacename,1,0,"TIDY")
(4) Table Entries
Creating numerically indexed table entries does not need counters but can use the table library insert function:
ii=ii+1
tblCorrected[ii] = strPlacename
can be replaced by:
table.insert(tblCorrected,strPlacename)
and #tblCorrected gives the number of table entries which is equivalent to ii
It should be possible to populate the Result Set tables entry by entry in the main Plugin flow rather than copying at the end.
(5) Plugin Purpose
Your plugin description mentions malformed place names and the definition includes leading and multiple commas.
However, it is generally advised that users employ a fixed number of place column parts each with an assigned role.
i.e. Town, County, State, Country
If some parts are unknown then leading and multiple commas are mandatory, not malformed.
e.g.
, , , China
Newton, , , USA
The tidy format is primarily intended for Diagrams and Reports to display such advised Place name formats tidily.
That tidy format is unlikely to be actually applied to Place record Place names.
Its attraction is that it collapses Place names into a consistent form to detect duplicates.
e.g. The following three Place names all tidy to Newton, USA
Newton ,, ,USA
Newton,,,USA
, Newton,,USA
That highlights the duplication and if those three Place records need to be in any standard format they need merging.
The most likely standard format would be:
Newton, , , USA
which is different from all those three records and is not the tidy format.
The objective I initially proposed was to list such duplicate Place records that would need merging if a standard Place name format were applied. The standard Place name format would probably not be the tidy format.
(1) Plugin Header
The header is missing some necessary FH V7 fields (that are also FH V6 compatible).
Use Edit > Insert Script Header to get all the fields of which @Type: and @Licence: are crucial.
(2) Opening Message
The first message mentions "malformed addresses" rather than places, which could confuse users.
(3) Place Tidy
The tidying is the same as that offered by the =TextPart(...) built-in FH function.
So your two functions could be replaced by a call of that TextPart function:
tidyCheck(strPlacename) => if strPlacename ~= fhCallBuiltInFunction("TextPart",strPlacename,1,0,"TIDY") then
tidyPlacename(strPlacename) => strPlacename = fhCallBuiltInFunction("TextPart",strPlacename,1,0,"TIDY")
(4) Table Entries
Creating numerically indexed table entries does not need counters but can use the table library insert function:
ii=ii+1
tblCorrected[ii] = strPlacename
can be replaced by:
table.insert(tblCorrected,strPlacename)
and #tblCorrected gives the number of table entries which is equivalent to ii
It should be possible to populate the Result Set tables entry by entry in the main Plugin flow rather than copying at the end.
(5) Plugin Purpose
Your plugin description mentions malformed place names and the definition includes leading and multiple commas.
However, it is generally advised that users employ a fixed number of place column parts each with an assigned role.
i.e. Town, County, State, Country
If some parts are unknown then leading and multiple commas are mandatory, not malformed.
e.g.
, , , China
Newton, , , USA
The tidy format is primarily intended for Diagrams and Reports to display such advised Place name formats tidily.
That tidy format is unlikely to be actually applied to Place record Place names.
Its attraction is that it collapses Place names into a consistent form to detect duplicates.
e.g. The following three Place names all tidy to Newton, USA
Newton ,, ,USA
Newton,,,USA
, Newton,,USA
That highlights the duplication and if those three Place records need to be in any standard format they need merging.
The most likely standard format would be:
Newton, , , USA
which is different from all those three records and is not the tidy format.
The objective I initially proposed was to list such duplicate Place records that would need merging if a standard Place name format were applied. The standard Place name format would probably not be the tidy format.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Mike
Many thanks for checking that for me. No problem in waiting - I'm not going anywhere at present
(1) Plugin Header - Fixed
(2) Opening Message - Fixed
(3) Place Tidy
I'm having a little difficulty with this one - hence writing my own routine.
If we look at the following example, it deals with the leading comma but doesn't deal with either the space before the second comma, nor with the multiple spaces between Potter and Heigham.
(4) Table Entries
Thanks Mike, I will replace the code here.
(5) Plugin Purpose
I appreciate that some people use commas where parts are missing, I'm afraid that I'm not one of them
My objective was to find malformed names (based on my definition) so that people could be aware of them and change them if they so wished.
The addition of place names that would clash if the user tried to correct them was to make sure that were aware that they would need to merge them.
As there will always be a difference of opinion amongst users as to whether or not they wish to to use the commas to separate missing parts I can't see a workable solution to do that as there is no way of telling whether the additional commas have been inserted for that purpose, or if they are an accidental typo (my original premise).
Perhaps I should rewrite the welcome screen to better clarify the aim of the plugin?
Many thanks
Dave
Many thanks for checking that for me. No problem in waiting - I'm not going anywhere at present
(1) Plugin Header - Fixed
(2) Opening Message - Fixed
(3) Place Tidy
I'm having a little difficulty with this one - hence writing my own routine.
If we look at the following example, it deals with the leading comma but doesn't deal with either the space before the second comma, nor with the multiple spaces between Potter and Heigham.
Code: Select all
strPlacename = ",Potter Heigham ,England"
strResult = fhCallBuiltInFunction("TextPart",strPlacename,1,0,"TIDY")
fhMessageBox(strResult)(4) Table Entries
Thanks Mike, I will replace the code here.
(5) Plugin Purpose
I appreciate that some people use commas where parts are missing, I'm afraid that I'm not one of them
My objective was to find malformed names (based on my definition) so that people could be aware of them and change them if they so wished.
The addition of place names that would clash if the user tried to correct them was to make sure that were aware that they would need to merge them.
As there will always be a difference of opinion amongst users as to whether or not they wish to to use the commas to separate missing parts I can't see a workable solution to do that as there is no way of telling whether the additional commas have been inserted for that purpose, or if they are an accidental typo (my original premise).
Perhaps I should rewrite the welcome screen to better clarify the aim of the plugin?
Many thanks
Dave
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
(3) Place Tidy
I had not spotted that, which looks like a longstanding bug.
The Help for TextPart says: 'For example, ",one,, ,,three ,four,,," would become "one, three, four" after being tidied.'
Notice the space after 'three ' has been correctly pruned but the actual function does not perform.
I will report that as a fault to CP.
(5) Plugin Purpose
This grew out of the Place name reports raised by the Export Gedcom File plugin, which only identify records with conflicting names when tidied. So in my mind, that was the objective of this Plugin, but your objective is a bit different.
As you say, it is important to make the Plugin purpose clear.
As a matter of interest, have you tried the three Newton,,USA Place names in your Plugin?
It does not identify them as conflicting suggesting a merge.
If there were a great many more places listed, then unless they are sorted by Suggested Correction the user may not spot the conflict until attempting to apply that correction.
I had not spotted that, which looks like a longstanding bug.
The Help for TextPart says: 'For example, ",one,, ,,three ,four,,," would become "one, three, four" after being tidied.'
Notice the space after 'three ' has been correctly pruned but the actual function does not perform.
I will report that as a fault to CP.
(5) Plugin Purpose
This grew out of the Place name reports raised by the Export Gedcom File plugin, which only identify records with conflicting names when tidied. So in my mind, that was the objective of this Plugin, but your objective is a bit different.
As you say, it is important to make the Plugin purpose clear.
As a matter of interest, have you tried the three Newton,,USA Place names in your Plugin?
It does not identify them as conflicting suggesting a merge.
If there were a great many more places listed, then unless they are sorted by Suggested Correction the user may not spot the conflict until attempting to apply that correction.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
(3) Place Tidy
The temporary fix for the TextPart bug is fhCallBuiltInFunction("TextPart",strText,1,0,"TIDY"):gsub(" +([ ,])","%1")
I've not heard back from CP yet.
(6) Initialise
It is advisable to use the following function at the beginning of the Plugin.
It prevents it from being run in an earlier FH version and ensures outstanding changes are saved before running.
fhInitialise(6,2,0,"save_recommended")
The temporary fix for the TextPart bug is fhCallBuiltInFunction("TextPart",strText,1,0,"TIDY"):gsub(" +([ ,])","%1")
I've not heard back from CP yet.
(6) Initialise
It is advisable to use the following function at the beginning of the Plugin.
It prevents it from being run in an earlier FH version and ensures outstanding changes are saved before running.
fhInitialise(6,2,0,"save_recommended")
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
-
quarlton
- Famous
- Posts: 150
- Joined: 26 Feb 2004 13:07
- Family Historian: V7
- Location: Lincolnshire
- Contact:
Re: Find Duplicate Place Names
Hi Mike
Thanks for the modified TIDY code, it works as promised.
Looking at the Newton scenario it would seem that my approach does work.
I created 4 Places:
Newton ,, ,USA
Newton,,,USA
, Newton,,USA
Newton, USA
The first 3 report as being 'Malformed' and tidy to Newton, USA
They also show as clashing with the existing Newton, USA
This is certainly giving my brain a good workout, and I thank you for your assistance.
I've attached an updated Gedcom with the Newton records in it.
The original inspiration was the routine in your Export Gedcom File as it highlighted a number of errors in my friend's Gedcom and I thought that it would be a good idea to try and replicate this as a stand-alone plugin.
I suppose that the fundamental issue left to resolve is regarding how places should be tidied.
If I were to go down the 'Standard' format whereby places are formatted with separating commas to identify missing data:
Grantham, Lincolnshire, England
Grantham,, England
Grantham,,, England
Barrowby,,, England
This aligns all places in the same columns.
As it happens, this is correct for the first 3 entries.
However, the fourth entry intentionally has 3 commas because the correct address is
Barrowby, Grantham, Lincolnshire, England
I wonder if I am guilty of over-thinking it?
Why would the user put in commas for Grantham and Lincolnshire unless they knew they should be there, and if they know that then they would put them in. Oh my head hurts
BTW, I seem to have read something recently that there is a way to quickly restore the FH Sample Project, but can't remember where I saw it (or if I've imagined it)
Many thanks
Thanks for the modified TIDY code, it works as promised.
Looking at the Newton scenario it would seem that my approach does work.
I created 4 Places:
Newton ,, ,USA
Newton,,,USA
, Newton,,USA
Newton, USA
The first 3 report as being 'Malformed' and tidy to Newton, USA
They also show as clashing with the existing Newton, USA
This is certainly giving my brain a good workout, and I thank you for your assistance.
I've attached an updated Gedcom with the Newton records in it.
The original inspiration was the routine in your Export Gedcom File as it highlighted a number of errors in my friend's Gedcom and I thought that it would be a good idea to try and replicate this as a stand-alone plugin.
I suppose that the fundamental issue left to resolve is regarding how places should be tidied.
If I were to go down the 'Standard' format whereby places are formatted with separating commas to identify missing data:
Grantham, Lincolnshire, England
Grantham,, England
Grantham,,, England
Barrowby,,, England
This aligns all places in the same columns.
As it happens, this is correct for the first 3 entries.
However, the fourth entry intentionally has 3 commas because the correct address is
Barrowby, Grantham, Lincolnshire, England
I wonder if I am guilty of over-thinking it?
Why would the user put in commas for Grantham and Lincolnshire unless they knew they should be there, and if they know that then they would put them in. Oh my head hurts
BTW, I seem to have read something recently that there is a way to quickly restore the FH Sample Project, but can't remember where I saw it (or if I've imagined it)
Many thanks
- Attachments
-
Test FH Sample Project.ged- Modified to include Newton, USA test places
- (86 KiB) Downloaded 53 times
Dave Simpson ~ Boulton, Braham, Carney, Simpson and Jacobs
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Find Duplicate Place Names
The test I suggested for the three Newton, , USA names deliberately did NOT include the tidy Newton, USA
There is no guarantee that Newton, USA will exist as a Place record, so please review your Plugin without it.
Assuming the standard format is Town, County, Country then...
1) Grantham, Lincolnshire, England is valid with all three known
2) Grantham,, England is valid but with the County unknown
3) Grantham,,, England is invalid with too many comma parts and Grantham is in the wrong column
4) Barrowby,,, England is invalid with too many comma parts
5) Lincolnshire,, England has a valid number of commas but Lincolnshire is in the wrong column
An automatic process for tidying to that format has major problems.
It can easily report when the number of commas is wrong.
But it cannot report when a place part is in the wrong column without having a worldwide gazetteer of place names.
If it simply removes the extra comma in the above cases 3) & 4) then 3) would be correct but 4) wrong.
Also, without knowing all the Counties, 5) would remain wrong.
It is for those complications that my concept for the Plugin was to only report conflicting Place records that need merging if a standard format were applied. The actual standard format is irrelevant to produce such a report.
When entering Place names in Facts it is all too easy to enter the wrong number of commas in the wrong position.
FH will not complain. It is only by reviewing via Tools > Work with Data > Places or with Plugins that mistakes are found.
Use File > Project Window > More Tasks... > Samples > Reset Sample Project to reset FH Sample Project.
There is no guarantee that Newton, USA will exist as a Place record, so please review your Plugin without it.
Assuming the standard format is Town, County, Country then...
1) Grantham, Lincolnshire, England is valid with all three known
2) Grantham,, England is valid but with the County unknown
3) Grantham,,, England is invalid with too many comma parts and Grantham is in the wrong column
4) Barrowby,,, England is invalid with too many comma parts
5) Lincolnshire,, England has a valid number of commas but Lincolnshire is in the wrong column
An automatic process for tidying to that format has major problems.
It can easily report when the number of commas is wrong.
But it cannot report when a place part is in the wrong column without having a worldwide gazetteer of place names.
If it simply removes the extra comma in the above cases 3) & 4) then 3) would be correct but 4) wrong.
Also, without knowing all the Counties, 5) would remain wrong.
It is for those complications that my concept for the Plugin was to only report conflicting Place records that need merging if a standard format were applied. The actual standard format is irrelevant to produce such a report.
When entering Place names in Facts it is all too easy to enter the wrong number of commas in the wrong position.
FH will not complain. It is only by reviewing via Tools > Work with Data > Places or with Plugins that mistakes are found.
Use File > Project Window > More Tasks... > Samples > Reset Sample Project to reset FH Sample Project.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry