Page 2 of 3

Find Duplicate Individuals (All Relations)

Posted: 27 Nov 2012 20:48
by tatewise
Yes, sorry Bill that is a bug.
Somehow the line of code detecting the Stop in the Loading phase went AWOL.

The shorter Loading phase, using the new recursive algorithm, processes Record Id in what appears a random order.
An alternative would be to simply display a count of how many records have been Loaded, which will always increase sequentially.

The longer Scoring phase still ascends sequentially through Record Id from 1, but in an effort to reduce Progress Bar overhead the display is only updated about once per second or percent, so some may get skipped.

I realise on your system the run-time has increased about 10% with respect to Version 3.1, whereas on mine it has reduced about 10%.

Find Duplicate Individuals (All Relations)

Posted: 27 Nov 2012 20:53
by gerrynuk
tatewise said:
Please can you all try Find Duplicate Individuals V3.1f via the WiP page, which has revised the Progress Bar messages again, plus a few other minor changes, as mentioned above.

When convenient, can you please supply the information requested in my previous posting above.
Loading approx 2m 10 sec
Scoring 1min 10 sec

Individualas: 7314
Families: 2077

Is scoring the same as checking? I didn't see any checking message unless it flashed up very quickly.

Find Duplicate Individuals (All Relations)

Posted: 27 Nov 2012 21:12
by tatewise
Thanks Gerry, yes, I have change the earlier request to Scoring Records.

Find Duplicate Individuals (All Relations)

Posted: 27 Nov 2012 21:23
by johnmorrisoniom
Hi Mike.

Version 3.1f

Individuals 32864
families 8991

Loading time 13m 15secs. Record ID and time update at random intervals Record Id's Non sequential

Scoring time 23m 05s. Record ID and time update at fairly regular intervals. Record ID's sequential.

Total run time 36m 20s

Time to display result set: less than 1 min

{edit}

I tried pre sorting records by ID, but that had no effect on the randomness of loading record ID's (I would guess they are loaded sequentially in the order they are accessed in the gedcom file)

However pressing the 'stop finding duplicates' had no effect at all. And even clicking the top right x on the window does not stop the plugin, it just closes the progress window, which then can't be re-opened.

Find Duplicate Individuals (All Relations)

Posted: 27 Nov 2012 22:20
by BillH
Mike,

As for the record id display, it is completely up to you. I can live with it no matter how it is done. The plugin is so fast on my system with my database, that it really doesn't matter. Whatever speeds things up for those that have longer processing times is the way to go.

Thanks!

Bill

Find Duplicate Individuals (All Relations)

Posted: 29 Nov 2012 15:15
by tatewise
Please can you all try Find Duplicate Individuals V3.1g via the WiP page.

This has reverted to the original sequential processing of Record Id that many prefer, especially as the separate Loading and Scoring phased approach did not result in the run-time reductions I had expected.

Could you please provide your usual feedback of project size and run-time.

Gerry ~ I am fascinated that your Project is the only one that took longer in the Loading phase than the Scoring phase, and is one of the smaller Projects.
The Loading phase placed demands on the FH to Plugin interface to extract data from the Project.
Whereas, the Scoring phase was almost exclusively internal Plugin processing with little FH interaction until the Result Set display.
Is there anything unusual about your Project or PC set-up that might account for this anomaly?

Find Duplicate Individuals (All Relations)

Posted: 29 Nov 2012 17:16
by BillH
Mike,

On my database of 10,353 individuals:

Version 3.1: 98 seconds
Version 3.1g: 102 seconds

Thanks for the change to not have the progress window take focus. I was able to type this post while the plugin was running.

Thanks!

Bill

Find Duplicate Individuals (All Relations)

Posted: 29 Nov 2012 20:59
by tatewise
Bill ~ In posting Re: Find Duplicate Individuals (All Relations) Posted on: 26/11/12 you suggested V3.1 took 80 s.
Do the timings tend to vary depending on what else you are doing?

Find Duplicate Individuals (All Relations)

Posted: 29 Nov 2012 21:07
by BillH
Mike,

I was surprised by this increase in duration as well. Usually, I find that the plugin is pretty consistent on its duration. I don't find that the timing varies depending on what I'm running.

Just now I ran them again and version 3.1 took 85 seconds and version 3.1g took 87 seconds.

Not sure what was going on this morning. I wasn't running any additional programs I was aware of. Maybe my antivirus was doing a background scan or something?

Bill

Find Duplicate Individuals (All Relations)

Posted: 30 Nov 2012 13:29
by gerrynuk
tatewise said:
Gerry ~ I am fascinated that your Project is the only one that took longer in the Loading phase than the Scoring phase, and is one of the smaller Projects.
The Loading phase placed demands on the FH to Plugin interface to extract data from the Project.
Whereas, the Scoring phase was almost exclusively internal Plugin processing with little FH interaction until the Result Set display.
Is there anything unusual about your Project or PC set-up that might account for this anomaly?
Mike, I am running FH in Windows XP under Parallels on my iMac. Not sure why this should make any difference to the relative speeds for each section. Would the number of Sources attached to an individual make any difference? There are 2970 sources and nearly 3200 multimedia items.

Find Duplicate Individuals (All Relations)

Posted: 30 Nov 2012 16:11
by tatewise
It is not so much the number of Source Records as the number of Citations that might be the cause.

What might be useful is to run the Show Project Statistics Plugin and look at the Records tab that shows the number of Source Records and the number of (Citation) Links to them.

Find Duplicate Individuals (All Relations)

Posted: 30 Nov 2012 16:38
by BillH
I'm not sure if it would help for comparison reasons, but I ran the Show Project Statistics plugin and my database has 1,952 sources with 49,661 links.

Bill

Find Duplicate Individuals (All Relations)

Posted: 30 Nov 2012 17:30
by gerrynuk
tatewise said:
It is not so much the number of Source Records as the number of Citations that might be the cause.

What might be useful is to run the Show Project Statistics Plugin and look at the Records tab that shows the number of Source Records and the number of (Citation) Links to them.
Here are the results:

ImageImage

Find Duplicate Individuals (All Relations)

Posted: 01 Dec 2012 16:56
by tatewise
Gerry ~ Nothing in those statistics looks particularly unusual.
There are rather more Flags than I tend to use, but that should not matter much.

Would it be interesting if I created a temporary Plugin designed to report the run-time of the Plugin FH API interface versus purely Plugin LUA Code?
This could be run by various users and we could compare times.

Quite unrelated, I notice from your statistics that you someone Born aged 18 and someone in a Census aged -2.

Find Duplicate Individuals (All Relations)

Posted: 01 Dec 2012 19:39
by gerrynuk
tatewise said:
Gerry ~ Nothing in those statistics looks particularly unusual.
There are rather more Flags than I tend to use, but that should not matter much.

Would it be interesting if I created a temporary Plugin designed to report the run-time of the Plugin FH API interface versus purely Plugin LUA Code?
This could be run by various users and we could compare times.
Sounds like a good idea!
Quite unrelated, I notice from your statistics that you someone Born aged 18 and someone in a Census aged -2.
Yes, I had spotted these - just a question of finding them! Any suggestions?

Find Duplicate Individuals (All Relations)

Posted: 01 Dec 2012 20:20
by johnmorrisoniom
I used a query to find age at birth and excluded those with age was null (I had someone born aged 102)

Find Duplicate Individuals (All Relations)

Posted: 01 Dec 2012 22:11
by gerrynuk
Unfortunately a simple query isn't showing up the culprits as they must be buried somewhere in the Facts/Events.

Find Duplicate Individuals (All Relations)

Posted: 02 Dec 2012 16:35
by tatewise
The age statistics are all under Age@ so as the Help & Advice mentions they are derived from the AgeAt() function and do NOT exist as Fact Date fields.

So make a Custom copy of the Standard 'All Facts' Query, and add a Column for Age At using  =AgeAt(FactOwner(%FACT%,1,MALES_FIRST),%FACT.DATE%).

Then in the Result Set click on the Age At Column to sort into ascending order, or Alt key click to sort into descending order.
Click on the Fact Column to sort the Facts by Name, i.e. bring all the Birth Facts together.
Alternatively, use the Columns tab Sort option at the bottom.

Find Duplicate Individuals (All Relations)

Posted: 02 Dec 2012 17:20
by tatewise
Would any of you like to try the Assess Plugins V1.0 via the WiP page.

This Plugin has two phases.
The first phase loops through each Individual Record and performs a variety of FH API interface functions (all read only) but does very little else.
The second phase loops through a variety of LUA local script statements with no FH API at all.
The run-time of the two phases is reported at the end.
The phases take about 2 secs each for 2000 Individuals and 1000000 loops on my Windows 7 Home Premium SP1 64-bit PC with Pentium Dual-Core 2.6 GHz CPU and 3 GB RAM.

Find Duplicate Individuals (All Relations)

Posted: 02 Dec 2012 18:41
by BillH
Mike,

Here are my results for the Assess Plugins plugin:

Image

Bill

Find Duplicate Individuals (All Relations)

Posted: 02 Dec 2012 19:46
by gerrynuk
tatewise said:
The age statistics are all under Age@ so as the Help & Advice mentions they are derived from the AgeAt() function and do NOT exist as Fact Date fields.
.....
Thanks, Mike. That showed up the problems nicely.

Find Duplicate Individuals (All Relations)

Posted: 02 Dec 2012 19:51
by gerrynuk
Mike,

Here are the results for my setup (Win XP running under Parallels on an iMac):

Image

Find Duplicate Individuals (All Relations)

Posted: 03 Dec 2012 09:25
by johnmorrisoniom
Hi Mike.
An Odd result on my XP SP3 2.4g Quad Core.

Image

The data file has 33018 Individuals and 9040 families

Find Duplicate Individuals (All Relations)

Posted: 03 Dec 2012 09:57
by johnmorrisoniom
You can't post an image on an edit. So this is an edit to my previous post.

I have Found the problem.
Because the plugin window keeps grabbing the focus, I had managed to cancel the run unintentionally.

The correct result should have been:

Image

Find Duplicate Individuals (All Relations)

Posted: 03 Dec 2012 12:10
by tatewise
Gerry & John ~ Your assessments are remarkably similar, given that the results are rounded down to the nearest second.
The ratio between FH API and LUA Script is approaching 10 to 1 on both PC.

Whereas, Bill & my results have a ratio much nearer 1 to 1.

John's 2.4 GHz Quad-Core CPU is very similar to my 2.6 GHz Dual-Core CPU, yet the FH API time per 1,000 Individuals is about 10 times mine (& Bill's).

How much RAM do you each have on your PC?

Or maybe it is a Windows XP versus Windows 7 characteristic?