Page 1 of 1

Determining the encoding of a string

Posted: 31 Jan 2022 18:28
by ColeValleyGirl
Does anyone have a code snippet to determine the encoding of a string ('ANSI', UTF-8, UTF-16LE). The string might of course be read from a file...

Background: I'm still working on the KB article on handling non-ANSI characters; I have a solution for dealing with UTF-16 strings (already added to the KB as a snippet) but need to offer a solution for determining what the encoding of a file is.

No worries if nobody has one - I can knock one up, but don't like reinventing wheels - life is too short.

Re: Determining the encoding of a string

Posted: 31 Jan 2022 18:42
by tatewise
This is derived from the code in my Multi-Project Person Index plugin.
The UTF-8 multi-byte encoding can occur anywhere in a file so if you don't check entire file it may be missed.
e.g. A file that has no UTF-8 BOM and only one UTF-8 character near the end.

Code: Select all

if sText:match("^\xEF\xBB\xBF")			-- "" = UTF-8 BOM
or sText:match("[\xC2-\xF4][\x80-\xBF]+") then 	-- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
elseif sText:match("^\xFF\xFE")			-- "ÿþ" = UTF-16 BOM
    or sText:match("^.\0") then			-- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
else
     bUnicode = false				-- ANSI
end

Re: Determining the encoding of a string

Posted: 31 Jan 2022 18:48
by ColeValleyGirl
Thanks, Mike. Can I post that as a snippet attributed to you?

Re: Determining the encoding of a string

Posted: 31 Jan 2022 18:50
by tatewise
I don't mind how it is posted or attributed. Use it any way that is appropriate.

Re: Determining the encoding of a string

Posted: 31 Jan 2022 19:04
by tatewise
Forgot to mention that if there is no UTF-16 BOM then sText:match("^.\0") is far from foolproof.
It relies on the first character being in the subset that has 2nd byte zero.
I don't know of a method of checking through a file for UTF-16 encoding patterns.

Re: Determining the encoding of a string

Posted: 31 Jan 2022 19:23
by ColeValleyGirl
My thinking is that the UTF-16LE files a plugin author will need to deal with are either FH config files (were the encoding is generally known in advance, such as Fact Sets or Queries ), or Gedcom files where the encoding isn't known but FH appends the UTF-16LE BOM.

Re: Determining the encoding of a string

Posted: 31 Jan 2022 19:38
by ColeValleyGirl

Re: Determining the encoding of a string

Posted: 31 Jan 2022 22:26
by tatewise
Yes, that should work OK.

Re: Determining the encoding of a string

Posted: 02 Feb 2022 10:03
by ColeValleyGirl
Mike,

I've just been using this snippet to detect the encoding of a UTF-16 file with BOM and for some reason it isn't working.

Thoughts? (Test file attached). I may simply be stupid, of course...

Code: Select all

function TestEncoding(sText)
  local bUnicode, iBits
  if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
  or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
  elseif sText:match("^\xFF\xFE") -- "ÿþ" = UTF-16 BOM
  or sText:match("^.\0") then -- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
  else 
    bUnicode = false -- ANSI
  end
  return bUnicode, iBits
end

f = io.open("D:\\UTF16 Copy.txt",'r')
local t = f:read('a')
f:close()
bUnicode, iBits = TestEncoding(t)


Re: Determining the encoding of a string

Posted: 02 Feb 2022 10:57
by tatewise
In what way is it not working, as the function returns bUnicode=true and iBits=16 for me.
Does it work for files with a UTF-8 BOM?
Have you overloaded the string.match() function in some way to upset matching?

With Plugin File Encoding set to ANSI that UTF16 file content is shown in debugger as below with the UTF-16 BOM ÿþ:
ÿþ>@5<
It works just the same with Plugin File Encoding set to UTF-8 but the file content is shown in debugger as:
��>@5<

Re: Determining the encoding of a string

Posted: 02 Feb 2022 11:13
by ColeValleyGirl
It works for UTF-8 with a BOM.

And yes, I've overlaid the string library with the .utf8 library , so that string handling from Penlight works for utf8 encoding -- although I haven't tested every function, by inspection Penlight string handling relies on the string library.

Looks like I'll need to develop my own snippet to work in that environment. (I've tested without the overlay and yours works fine -- may need to add some caveats to the snippet in the kb).

Re: Determining the encoding of a string

Posted: 02 Feb 2022 12:14
by tatewise
The following code works with utf8 overlaid on string if that is of any use.
It assumes UTF-8 string encoding is the default.

Code: Select all

function TestEncoding(sText)
  local bUnicode, iBits
  fhSetStringEncoding("ANSI")
  if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
  or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
  elseif fhCallBuiltInFunction('LeftText',sText,2) == "\xFF\xFE" -- "ÿþ" = UTF-16 BOM
  or fhCallBuiltInFunction('MidText',sText,2,1) == "\0" then -- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
  else 
    bUnicode = false -- ANSI
  end
  fhSetStringEncoding("UTF-8")
  return bUnicode, iBits
end

Re: Determining the encoding of a string

Posted: 02 Feb 2022 15:11
by ColeValleyGirl
Thanks, Mike. I'll update the snippet to use that code instead. (I'll include a line to save the current string encoding before changing it, and restore it at the end.

It doesn't work with files read using fhLoadTextFile; I assume FH strips the BOM. But then, you need to know the encoding before you use fhLoadTextFile...

In practice, I suspect the only significant use for this is determining whether a Gedcom file is UTF-16LE or UTF-8 (for plugin authors who want to access the Gedcom directly and not via the API, and need to handle both encodings).

Re: Determining the encoding of a string

Posted: 02 Feb 2022 15:45
by tatewise
Yes, OK.
I also tried fhLoadTextFile() but with a similar lack of success.
Tried using Lua string.sub() and string.byte() without success.
Some of my plugins allow other kinds of file supplied by users that could use any encoding, but I agree most likely for GEDCOM.