* cleansing source text

Questions regarding use of any Version of Family Historian. Please ensure you have set your Version of Family Historian in your Profile. If your question fits in one of these subject-specific sub-forums, please ask it there.
Post Reply
avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

cleansing source text

Post by JoopvB » 27 May 2015 12:03

Some websites display odd behavior when dragged and dropped to the text source field (has something to do with IE). They insert a great number of linefeeds. Changing the text (per source) is a lot of work and is contradictory to the ease of use of drag and drop. So I my idea was to just do the drag and drop and once in a while use Mike's search and replace plugin to do some housecleaning. But... I've been playing with some LUA strings but can't find the solution.

So the question is: what do I specify in Mike's plugin to replace any number of consecutive newlines to a space?

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 27 May 2015 13:00

Yes, Joop, that is not very obvious, but here is the method.

Of course you have selected LUA Pattern Mode top right.
In Basic Filters you have only ticked Text From Source.

In the Search box enter three newline characters (i.e. press Enter three times).
Then enter the + sign to match repetitions of last newline.

In the Replace box enter a single space character.

This Search pattern will NOT match a solitary newline, or a pair of newline characters.
That is wise because single and double line spacing occurs quite often.

There must be 3 or more adjacent newline characters for a match.
These will all be replaced by one space character.

However, if you want to match 2 newline, or 1 newline then reduce the number of newline characters in the Search box.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 27 May 2015 14:03

Hi Mike,

Thanks, this now works fine for consecutive linefeeds. It now appears that sometimes there are 1 or 2 spaces followed by a linefeed (and/or preceded). Is it possible to filter that out too?

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 27 May 2015 14:45

Yes, try the following.

In the Search box enter:
One newline
One space and a minus - character and a newline
%s+

Space minus will match any number of spaces between the 1st and 2nd newline (including none).

%s+ matches one or more white space characters such as tab, space, newline.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 27 May 2015 18:13

Beautiful Mike, thanks!

It's like magic... would it (on top of this) also be possible to change : followed by newline by : followed by space? Can that be done in the same "run" with the previous newline trick?

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 27 May 2015 18:45

Are you saying the : and newline may appear anywhere in the multiple spaces & newlines?

Even if so, I am still not sure if it is possible.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 27 May 2015 23:00

It's usually (if not always):
item:
2 spaces newline
space newline
data
space newline
space newline
next item

to be replaced by:
item: data newline
nextitem

An example of the site with this behavior is on link:

http://www.streekarchiefvpr.nl/pages/nl ... miview=ldt

I hope it's possible; the first solution is already pretty close. Anyway, thanks Mike for taking the time to help me out.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 28 May 2015 09:11

Hi Joop, this should do it:

In Search: box enter

(:?)%s%s%s+
(.-)%s%s%s+

This has only one newline character in the middle.

In Replace: box enter

%1 %2

This replace text is percent one space percent two newline
(I assume you don't need a space after the data, but if you do then add a space after %2)

The Search works as follows:
(:?) 1st capture is an optional colon
%s%s%s+ matches at least 3 white space characters but unlimited in length
newline matches a newline character
(.-) 2nd capture is the shortest data without multiple white space
%s%s%s+ matches at least 3 white space characters but unlimited in length

The Replace works as follows:
%1 inserts the 1st capture of colon or nothing
space
%2 inserts the 2nd capture of data
newline

The whole search & replace will repeat as many times as possible within one text field.

You can adjust the minimum number of white space characters to match before and after data by adding or removing %s magic codes.
To specifically match space or newline instead of all white space characters you could substitute %s with
[
]

i.e. square-bracket space newline square-bracket that matches any character within the brackets.
If you wanted to be more specific about what matches then you could use literal space or newline characters instead of %s.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 28 May 2015 14:11

Hi Mike, it's fantastic how powerful these tools are. Your solution is near to perfect, there is however one small snag that I can't explain. I've attached a text file to show you what goes wrong.

Most of the lines go well but starting from (see text file) Naam: Job van Beek, there is no newline after that and it goes on with Vader: and then a newline.
Obviously the Vader: should be on it's own line. Further on it also happens a few times.

I understand your explanation of the codes, but can't explain why this is happening and hence, have no idea how to cope with it.

!!! I planned to attach a text file, but I get an error:

Sorry, the board attachment quota has been reached.

No idea why. The file size is 1KB?

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 28 May 2015 14:32

Try attaching a text file now, but I really need the before and after.

One trick is to change the Replace box to

<%1> {%2}

Then you can see what is matching the two ( ) captures.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 28 May 2015 14:52

Text file with before and after.

I'm now going to try your last suggestion.
Attachments
FH trial text.txt
text copied from source text field (= before)
(1.02 KiB) Downloaded 188 times

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 28 May 2015 14:57

Hi Mike, I ran the test to find out what was captured. The result is:


Aktenummer<:> {8}
Geboorteplaats<:> {Nieuw Helvoet}
Geboortedatum<:> {06-02-1876}
Geborene<:> {Job van Beek}
Naam: Job van Beek<> {Vader:}
Gerrit van Beek<> {Naam: Gerrit van Beek}
Beroep: timmerman
Leeftijd: 31<> {Moeder:}
Margrieta Boelhouwer<> {Naam: Margrieta Boelhouwer}
Opmerkingen<:> {tijd 13.30 uur, akte 8}
Toegangsnummer<:> {092 Gemeente Nieuw-Helvoet (1811-1952)}
Inventarisnummer<:> {804}
laatste wijziging 02-12-2008

So the <> seems to be the problem, but no idea why it matches.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 28 May 2015 15:29

It is because lines like the following do NOT follow your rules:

Naam: Job van Beek

Naam: Gerrit van Beek
Beroep: timmerman
Leeftijd: 31

Not enough white space and no newline between colon and data.
So I will have to look at the pattern again.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 28 May 2015 18:00

This Search pattern does a good job but every label must end with a colon.

(:)%s%s+([^%s].-)%s%s%s+

Explanation:
(:) captures the colon
%s%s+ matches at least 2 white-space characters
([^%s].-) captures the data that must start with a non-white-space character
%s%s%s+ matches at least 3 white-space characters

Note:
%s matches one white-space character such as tab, space, newline
+ means match longest repeat of preceding character, but at least one of them
[^%s] matches any non-white-space character
. matches any one character
- means match shortest repeat of preceding character, and may be none

This works because there is always a colon and at least 2 white space characters after the label, and always at least 3 white space characters after the data, even when the label & data are all on one line.

Unfortunately, it does not remove the trailing white space after the last line, so either clean that separately, or try not to drag & drop it initially.

Don't forget to use the new Presets in Search and Replace to save your settings for future use, and give those Presets memorable names.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 28 May 2015 19:27

Hi Mike, thanks very much! It works perfectly; amazing so few codes, such great results.

And, I even understand your explanation of how it works (which does not mean I could make one up myself :)).

If, only now IE would behave a bit different for the other website (www) then that major part op copying from the external browser could also be gotten rid off. I have been trying to find some registry setting to change the behavior (anything but concatenating all together would be fixable by search and replace I think), but to no avail.

Anyway, thanks again for all your help on this bit of cleansing!

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 01 Feb 2019 17:26

Hi Mike,

It has been some time since my last post (been busy elsewhere), but now I've run into a problem with copying source text information from the Dutch genealogy website into FH (they have changed the format of the information on the website).

In some sense the format is now simple and straight forward. Every information item is copied in FH onto two lines, first one the label of the item, second one the value of the item. The cleanup would be to get this on one line (label, space, value). My hope is that you know to how use your plugin combined with some LUA magic to make this work.

Thanks in advance, Joop

P.S. Nice to be back on the forum which, as I see, is still very much alive and kicking. :)

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 01 Feb 2019 18:27

It will something more unique than just pairs of lines.
Otherwise, any general Search pattern that just looks for text on two successive lines, may join every pair of lines in every Text From Source field, including all the older existing ones.
Unless that is you can select only the Source records that have had that new text added.

Could you post some examples of what the text looks like.

Can you be sure that the last line ends with a newline character?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 01 Feb 2019 20:22

An example is:

Bruidegom
Pieter van den Broek
Beroep
voermansknecht
Geboorteplaats
Hilversum
Leeftijd
35
Bruid
Johanna Margaretha van den Ouden
Beroep
dienstbode
Geboorteplaats
Amsterdam
Leeftijd
33
Vader van de bruidegom
Albertus van den Broek
Moeder van de bruidegom
Margaretha Houtzager
Vader van de bruid
Leendert van den Ouden
Moeder van de bruid
Lukje Komst
Gebeurtenis
Huwelijk
Datum
06-07-1872
Gebeurtenisplaats
Hilversum
Documenttype
BS Huwelijk
Erfgoedinstelling
Noord-Hollands Archief
Plaats instelling
Haarlem
Collectiegebied
Noord-Holland
Aktenummer
28
Registratiedatum
06-07-1872
Akteplaats
Hilversum
Aktesoort
H

It doesn't start with an empty line and there is no CR-LF behind te last line.

The expected result:

Bruidegom Pieter van den Broeck
Beroep conducteur
Geboorteplaats Hilversum
Leeftijd 27
Bruid Willemina Rijsenbach
Geboorteplaats De Bilt
Leeftijd 26
Vader van de bruidegom Albertus van den Broeck
Moeder van de bruidegom Elisabeth Margaretha Houtzager
Vader van de bruid Willem Rijsenbach
Beroep Wagenmaker
Moeder van de bruid Jesina Verschoof
Gebeurtenis Huwelijk
Datum 25-03-1865
Gebeurtenisplaats Hilversum
Documenttype BS Huwelijk
Erfgoedinstelling Noord-Hollands Archief
Plaats instelling Haarlem
Collectiegebied Noord-Holland
Aktenummer 5
Registratiedatum 25-03-1865
Akteplaats Hilversum
Aktesoort H

I tried with Regex in Notepad++ and with search string "(\r\n.*)\r\n" and replace "\1 " and inserting a CR-LF up front and it did the job. But using LUA and the plugin is off course preferable.

User avatar
tatewise
Megastar
Posts: 27088
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: cleansing source text

Post by tatewise » 01 Feb 2019 21:05

It will make things much easier if you can ensure there is a blank newline after the last text line.

Then use the Search and Replace Plugin as shown below (Remember to make a Backup beforehand).
The Search pattern must end with one newline:
(.-)
(.-)


The Replace pattern must also end with one newline:
%1 %2


Ensure to set the Search Scope to Source Records (SOUR) and Select Records to be converted.
Ensure to set the Basic Filters to tick Text From Source fields only.
CleanSource.png
CleanSource.png (22.1 KiB) Viewed 8264 times
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

avatar
JoopvB
Superstar
Posts: 328
Joined: 02 May 2015 14:32
Family Historian: V7

Re: cleansing source text

Post by JoopvB » 01 Feb 2019 21:32

Mike, works like a charm.

Thanks and enjoy the weekend!

Post Reply