Page 1 of 1
Find Duplicate Citations Question
Posted: 04 Feb 2018 10:38
by DavidNewton
I did a stupid thing! I split off part of my tree in order to proof read the citations on the more important parts. I realise I should only work with one tree so save the criticism. Anyway after a while I merged the sub-tree back in and ended up with duplicate citations, hundreds of them, and all the corrected citations were second in the lists. I tried the Find Duplicate Citations plugin and it was deleting all the corrected citations. I didn't commit so no permanent harm done.
OK, so I wanted to delete the citations that were first added . After perusing the plugin I think I have found the necessary correction in this snippet
Code: Select all
for pcite in childitem(pfact,'SOUR') do
i = i + 1
pfSource = fhGetValueAsLink(pcite)
table.insert(tblDups,{rec=pcite:Clone(),id=fhGetRecordId(pfSource)})
end
table.sort(tblDups,function(a, b) return a.id > b.id end)
change the table insert to
Code: Select all
table.insert(tblDups,1,{rec=pcite:Clone(),id=fhGetRecordId(pfSource)})
and I think that reverses the order of the duplicate citation listings so that the later ones are kept. I have run this and it seems to work but I have not yet committed the changes.
I would appreciate some input as to whether this will produce unwanted side effects.
David
Re: Find Duplicate Citations Question
Posted: 04 Feb 2018 14:14
by tatewise
I believe the
Plugin does not guarantee the sorted order of the duplicate
Citations.
That is true regardless of which
table.insert(tblDups,...) variant is used.
The later
table.sort(tblDups,...) function is only sorting using one
Source Id > another
Source Id.
That means that where one
Source Id = another
Source Id the order is undefined.
The
Lua Reference Manual says:
The sort algorithm is not stable; that is, elements considered equal by the given order may have their relative positions changed by the sort.
So the
Plugin assumes that if two
Citations reference the same
Source record, then either can be deleted.
That you have found any particular preference for 1st or 2nd Citation is just coincidence.
Some experiments that I have run confirms the above.
So you will need to adjust the
Plugin further to sort on some other characteristics of the
Citation when the two
Source Id are equal.
Re: Find Duplicate Citations Question
Posted: 04 Feb 2018 16:56
by DavidNewton
Thanks Mike, I should have known it wasn't that easy. My criterion is simply that I want to keep the last occurrence of each citation so I wll make one more suggestion.
I will confess immediately that I cannot follow every detail of this plugin but I think that the id is used simply as a sort mechanism for this table. I am not that familiar with iterator functions such as childitem but it seems to me that before the table.sort the citations are in the order in which they appear in the ALL tab.
If that is the case then a possible work around would be to assume that there are no more than say 100 citations per fact (certainly the case in my file) and replace the id value by id=fhGetRecordId(pfSource)*100 + i. This will produce a unique id for each citation and include within it a code for the position of the citation in the list. Then the sort function a.id > b.id will produce a unique order.
If I have interpreted the sorting mechanism correctly then the occurences of the same citation will appear in the table in reverse order of occurence within the ALL tab. The first one of these, the last in the list, will be kept.
David
Re: Find Duplicate Citations Question
Posted: 04 Feb 2018 18:29
by tatewise
David, you are on the right lines with the sorting, but your idea won't work, because the Source Id must be equal for the duplication to be detected, and by adding the index you have guaranteed no duplicates will be found.
The solution is to add an extra index field to the tblDups, and when the id are equal then sort on the index.
e.g.
table.insert( tblDups, { rec=pcite:Clone(), id=fhGetRecordId(pfSource), index=i } )
end
table.sort( tblDups,function(a, b) if a.id == b.id then return a.index > b.index end return a.id > b.id end)
i.e. insert statement if a.id == b.id then return a.index > b.index end into comparison function.
Then as you say, the citations will be in reverse index order, but still grouped by Source Id.
BTW: The warning message needs to be corrected too.
Re: Find Duplicate Citations Question
Posted: 04 Feb 2018 21:48
by DavidNewton
Thank you.
David