* Find Duplicate Citations Question

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
Post Reply
User avatar
DavidNewton
Superstar
Posts: 462
Joined: 25 Mar 2014 11:46
Family Historian: V7

Find Duplicate Citations Question

Post by DavidNewton » 04 Feb 2018 10:38

I did a stupid thing! I split off part of my tree in order to proof read the citations on the more important parts. I realise I should only work with one tree so save the criticism. Anyway after a while I merged the sub-tree back in and ended up with duplicate citations, hundreds of them, and all the corrected citations were second in the lists. I tried the Find Duplicate Citations plugin and it was deleting all the corrected citations. I didn't commit so no permanent harm done.
OK, so I wanted to delete the citations that were first added . After perusing the plugin I think I have found the necessary correction in this snippet

Code: Select all

                for pcite in childitem(pfact,'SOUR') do
                    i = i + 1
                    pfSource = fhGetValueAsLink(pcite)
                    table.insert(tblDups,{rec=pcite:Clone(),id=fhGetRecordId(pfSource)})
                end
                table.sort(tblDups,function(a, b) return a.id > b.id end)
                
change the table insert to

Code: Select all

                    table.insert(tblDups,1,{rec=pcite:Clone(),id=fhGetRecordId(pfSource)})
and I think that reverses the order of the duplicate citation listings so that the later ones are kept. I have run this and it seems to work but I have not yet committed the changes.

I would appreciate some input as to whether this will produce unwanted side effects.

David

User avatar
tatewise
Megastar
Posts: 27084
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Find Duplicate Citations Question

Post by tatewise » 04 Feb 2018 14:14

I believe the Plugin does not guarantee the sorted order of the duplicate Citations.
That is true regardless of which table.insert(tblDups,...) variant is used.
The later table.sort(tblDups,...) function is only sorting using one Source Id > another Source Id.
That means that where one Source Id = another Source Id the order is undefined.
The Lua Reference Manual says:
The sort algorithm is not stable; that is, elements considered equal by the given order may have their relative positions changed by the sort.
So the Plugin assumes that if two Citations reference the same Source record, then either can be deleted.

That you have found any particular preference for 1st or 2nd Citation is just coincidence.

Some experiments that I have run confirms the above.

So you will need to adjust the Plugin further to sort on some other characteristics of the Citation when the two Source Id are equal.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
DavidNewton
Superstar
Posts: 462
Joined: 25 Mar 2014 11:46
Family Historian: V7

Re: Find Duplicate Citations Question

Post by DavidNewton » 04 Feb 2018 16:56

Thanks Mike, I should have known it wasn't that easy. My criterion is simply that I want to keep the last occurrence of each citation so I wll make one more suggestion.

I will confess immediately that I cannot follow every detail of this plugin but I think that the id is used simply as a sort mechanism for this table. I am not that familiar with iterator functions such as childitem but it seems to me that before the table.sort the citations are in the order in which they appear in the ALL tab.

If that is the case then a possible work around would be to assume that there are no more than say 100 citations per fact (certainly the case in my file) and replace the id value by id=fhGetRecordId(pfSource)*100 + i. This will produce a unique id for each citation and include within it a code for the position of the citation in the list. Then the sort function a.id > b.id will produce a unique order.

If I have interpreted the sorting mechanism correctly then the occurences of the same citation will appear in the table in reverse order of occurence within the ALL tab. The first one of these, the last in the list, will be kept.

David

User avatar
tatewise
Megastar
Posts: 27084
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Find Duplicate Citations Question

Post by tatewise » 04 Feb 2018 18:29

David, you are on the right lines with the sorting, but your idea won't work, because the Source Id must be equal for the duplication to be detected, and by adding the index you have guaranteed no duplicates will be found.

The solution is to add an extra index field to the tblDups, and when the id are equal then sort on the index.
e.g.
table.insert( tblDups, { rec=pcite:Clone(), id=fhGetRecordId(pfSource), index=i } )
end
table.sort( tblDups,function(a, b) if a.id == b.id then return a.index > b.index end return a.id > b.id end)

i.e. insert statement if a.id == b.id then return a.index > b.index end into comparison function.

Then as you say, the citations will be in reverse index order, but still grouped by Source Id.

BTW: The warning message needs to be corrected too.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
DavidNewton
Superstar
Posts: 462
Joined: 25 Mar 2014 11:46
Family Historian: V7

Re: Find Duplicate Citations Question

Post by DavidNewton » 04 Feb 2018 21:48

Thank you.

David

Post Reply