* Determining the encoding of a string

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
Post Reply
User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Determining the encoding of a string

Post by ColeValleyGirl » 31 Jan 2022 18:28

Does anyone have a code snippet to determine the encoding of a string ('ANSI', UTF-8, UTF-16LE). The string might of course be read from a file...

Background: I'm still working on the KB article on handling non-ANSI characters; I have a solution for dealing with UTF-16 strings (already added to the KB as a snippet) but need to offer a solution for determining what the encoding of a file is.

No worries if nobody has one - I can knock one up, but don't like reinventing wheels - life is too short.

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 31 Jan 2022 18:42

This is derived from the code in my Multi-Project Person Index plugin.
The UTF-8 multi-byte encoding can occur anywhere in a file so if you don't check entire file it may be missed.
e.g. A file that has no UTF-8 BOM and only one UTF-8 character near the end.

Code: Select all

if sText:match("^\xEF\xBB\xBF")			-- "" = UTF-8 BOM
or sText:match("[\xC2-\xF4][\x80-\xBF]+") then 	-- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
elseif sText:match("^\xFF\xFE")			-- "ÿþ" = UTF-16 BOM
    or sText:match("^.\0") then			-- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
else
     bUnicode = false				-- ANSI
end
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 31 Jan 2022 18:48

Thanks, Mike. Can I post that as a snippet attributed to you?

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 31 Jan 2022 18:50

I don't mind how it is posted or attributed. Use it any way that is appropriate.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 31 Jan 2022 19:04

Forgot to mention that if there is no UTF-16 BOM then sText:match("^.\0") is far from foolproof.
It relies on the first character being in the subset that has 2nd byte zero.
I don't know of a method of checking through a file for UTF-16 encoding patterns.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 31 Jan 2022 19:23

My thinking is that the UTF-16LE files a plugin author will need to deal with are either FH config files (were the encoding is generally known in advance, such as Fact Sets or Queries ), or Gedcom files where the encoding isn't known but FH appends the UTF-16LE BOM.

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 31 Jan 2022 19:38


User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 31 Jan 2022 22:26

Yes, that should work OK.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 02 Feb 2022 10:03

Mike,

I've just been using this snippet to detect the encoding of a UTF-16 file with BOM and for some reason it isn't working.

Thoughts? (Test file attached). I may simply be stupid, of course...

Code: Select all

function TestEncoding(sText)
  local bUnicode, iBits
  if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
  or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
  elseif sText:match("^\xFF\xFE") -- "ÿþ" = UTF-16 BOM
  or sText:match("^.\0") then -- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
  else 
    bUnicode = false -- ANSI
  end
  return bUnicode, iBits
end

f = io.open("D:\\UTF16 Copy.txt",'r')
local t = f:read('a')
f:close()
bUnicode, iBits = TestEncoding(t)

Attachments
UTF16 Copy.txt
(852 Bytes) Downloaded 51 times

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 02 Feb 2022 10:57

In what way is it not working, as the function returns bUnicode=true and iBits=16 for me.
Does it work for files with a UTF-8 BOM?
Have you overloaded the string.match() function in some way to upset matching?

With Plugin File Encoding set to ANSI that UTF16 file content is shown in debugger as below with the UTF-16 BOM ÿþ:
ÿþ>@5<
It works just the same with Plugin File Encoding set to UTF-8 but the file content is shown in debugger as:
��>@5<
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 02 Feb 2022 11:13

It works for UTF-8 with a BOM.

And yes, I've overlaid the string library with the .utf8 library , so that string handling from Penlight works for utf8 encoding -- although I haven't tested every function, by inspection Penlight string handling relies on the string library.

Looks like I'll need to develop my own snippet to work in that environment. (I've tested without the overlay and yours works fine -- may need to add some caveats to the snippet in the kb).

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 02 Feb 2022 12:14

The following code works with utf8 overlaid on string if that is of any use.
It assumes UTF-8 string encoding is the default.

Code: Select all

function TestEncoding(sText)
  local bUnicode, iBits
  fhSetStringEncoding("ANSI")
  if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
  or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
    bUnicode = true
    iBits = 8
  elseif fhCallBuiltInFunction('LeftText',sText,2) == "\xFF\xFE" -- "ÿþ" = UTF-16 BOM
  or fhCallBuiltInFunction('MidText',sText,2,1) == "\0" then -- UTF-16 2-byte encoding 2nd byte 0
    bUnicode = true
    iBits = 16
  else 
    bUnicode = false -- ANSI
  end
  fhSetStringEncoding("UTF-8")
  return bUnicode, iBits
end
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4853
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Determining the encoding of a string

Post by ColeValleyGirl » 02 Feb 2022 15:11

Thanks, Mike. I'll update the snippet to use that code instead. (I'll include a line to save the current string encoding before changing it, and restore it at the end.

It doesn't work with files read using fhLoadTextFile; I assume FH strips the BOM. But then, you need to know the encoding before you use fhLoadTextFile...

In practice, I suspect the only significant use for this is determining whether a Gedcom file is UTF-16LE or UTF-8 (for plugin authors who want to access the Gedcom directly and not via the API, and need to handle both encodings).

User avatar
tatewise
Megastar
Posts: 27080
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Determining the encoding of a string

Post by tatewise » 02 Feb 2022 15:45

Yes, OK.
I also tried fhLoadTextFile() but with a similar lack of success.
Tried using Lua string.sub() and string.byte() without success.
Some of my plugins allow other kinds of file supplied by users that could use any encoding, but I agree most likely for GEDCOM.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

Post Reply