Page 1 of 1

before I write my gedcom...

Posted: 14 Apr 2020 00:07
by Ron Melby
fhSetStringEncoding('UTF-8')

is the first line of my plugin

I intend to trim my address fields, and as I set up my program,
I am using print statements, to see the effect of my trim statement

I enter debug.

what prints in the console area in the utf8 strings are odd characters where my Norwegian letters should be.
my understanding is that it is due to the 5.1 print statement. and does not affect the actual byte stream I am reading and rewriting. Is my understanding correct, before I trash all my address fields?

Thanks in advance

Re: before I write my gedcom...

Posted: 14 Apr 2020 09:49
by tatewise
Yes, Lua has no knowledge of character encodings at all.
In the string library section of the reference manuals, it says such as:
The string library assumes one-byte character encodings.
Note that numerical codes are not necessarily portable across platforms.
The definition of what a lowercase letter is depends on the current locale.

So as far as Lua is concerned, strings are just a byte stream of integers.
How they happen to be displayed depends on the current operating system.

It is only the FH API functions that understand UTF-8 character encoding.
Also the lower right debug pane does show strings correctly, because that is managed by FH.

If you ever need to write Lua script to manipulate strings involving multi-byte UTF-8 characters, then the above limitations must be taken into account. For example, string.len(s) in Lua is the number of bytes, not the number of UTF-8 characters displayed in FH. See plugins:getting_started#fh_v6_unicode|> FH V6 Unicode and associated advice pages.

So avoid performing the Address trimming in Lua script because commas may appear to exist within UTF-8 multi-byte encodings. Only use the FH trimming functions.

Re: before I write my gedcom...

Posted: 14 Apr 2020 11:05
by Ron Melby
It seemed to be what you have explained. One hates to do large changes to ones file without certitude, (whether or not one can roll back those changes). My only trim is to insure no spaces at the front, no double spaces, and no spaces at the end.

return (str:gsub('%s+', ' '):gsub('^%s+', ''):gsub('%s+$', ''))

( , spc, nil, nil respectively)

^^This works everywhere I have used it in reporting, does not bother with anything beyond space chars. Are you thinking this is not going to work correctly? I use local len = UTF8len(str) when I want utf8 actual length, but here I am only concerned with extra spaces.

TEXT PART TIDY will remove my placeholder commas. Not at all what I want.
I do not know of any other trim function in fh.

in places, it seems like I can do the same, and do the _PLAC records as well, ignoring errors, then going into the place link file, and by hand melding the 0 records into any other record to rid myself of them.

I know of no way to get to links like they are on the screens directly with an api. that would be grand.

Re: before I write my gedcom...

Posted: 14 Apr 2020 11:54
by tatewise
Sorry, I assumed that by trim Address you meant tidy redundant commas, spaces, etc, etc.

Did you actually read all the pages of advice that I identified?

I advise you NOT to use the Lua group patterns such as %s that represents all white space characters.
Some of those characters may occur within UTF-8 bytes.
If you really only want to detect space characters, then use them in the the patterns.
i.e.
return (str:gsub(' +', ' '):gsub('^ +', ''):gsub(' +$', ''))

To trim Place names, just trim each Place record _PLAC.TEXT field.
Do NOT trim the Fact PLAC fields. They will be automatically synchronised with the changed records.
Then you will not need to manually tidy up.

Where did you find the UTF8len(str) function?

Did you see plugins:code_snippets:unicode_string_functions|> Unicode String Functions (code snippet) that offers several UTF-8 string functions?
e.g.

Code: Select all

-- Supply character length of UTF-8 text --
function length(strTxt)
      isFlag = fhIsConversionLossFlagSet()
      strTxt = fhConvertUTF8toANSI(strTxt or "")
      fhSetConversionLossFlag(isFlag)
      return string.len(strTxt)
end -- function length

Re: before I write my gedcom...

Posted: 14 Apr 2020 15:30
by Ron Melby
you told me where I could find it one time in the halcyon days of yore when I was writing output to a text file, and my names and addresses wouldn't line up on the output lines, and I was using only string.len I just made it a different name when I put it in my standard formatting library.

I remember a smattering of what you tell me, after all.

Re: before I write my gedcom...

Posted: 15 Apr 2020 11:27
by JohnnyCee
tatewise wrote:
14 Apr 2020 09:49
So avoid performing the Address trimming in Lua script because commas may appear to exist within UTF-8 multi-byte encodings. Only use the FH trimming functions.
Mike,

UTF-8 was carefully designed such that the value of any ASCII character (< 0x7F) would not appear as a byte value within a multi-byte character. I am not disagreeing with your advice to do text manipulation in FH, not Lua, I know nothing about writing plug-ins, just alerting you to this useful characteristic of UTF-8.

John

Re: before I write my gedcom...

Posted: 15 Apr 2020 11:42
by tatewise
You are absolutely right. I don't know why I gave that advice, which is wrong. :oops:
It is the Lua pattern matching codes like %s, %a, %l that cause the problems.