* Determining the encoding of a string
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Determining the encoding of a string
Does anyone have a code snippet to determine the encoding of a string ('ANSI', UTF-8, UTF-16LE). The string might of course be read from a file...
Background: I'm still working on the KB article on handling non-ANSI characters; I have a solution for dealing with UTF-16 strings (already added to the KB as a snippet) but need to offer a solution for determining what the encoding of a file is.
No worries if nobody has one - I can knock one up, but don't like reinventing wheels - life is too short.
Background: I'm still working on the KB article on handling non-ANSI characters; I have a solution for dealing with UTF-16 strings (already added to the KB as a snippet) but need to offer a solution for determining what the encoding of a file is.
No worries if nobody has one - I can knock one up, but don't like reinventing wheels - life is too short.
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
This is derived from the code in my Multi-Project Person Index plugin.
The UTF-8 multi-byte encoding can occur anywhere in a file so if you don't check entire file it may be missed.
e.g. A file that has no UTF-8 BOM and only one UTF-8 character near the end.
The UTF-8 multi-byte encoding can occur anywhere in a file so if you don't check entire file it may be missed.
e.g. A file that has no UTF-8 BOM and only one UTF-8 character near the end.
Code: Select all
if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
bUnicode = true
iBits = 8
elseif sText:match("^\xFF\xFE") -- "ÿþ" = UTF-16 BOM
or sText:match("^.\0") then -- UTF-16 2-byte encoding 2nd byte 0
bUnicode = true
iBits = 16
else
bUnicode = false -- ANSI
end
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Re: Determining the encoding of a string
Thanks, Mike. Can I post that as a snippet attributed to you?
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
I don't mind how it is posted or attributed. Use it any way that is appropriate.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
Forgot to mention that if there is no UTF-16 BOM then sText:match("^.\0") is far from foolproof.
It relies on the first character being in the subset that has 2nd byte zero.
I don't know of a method of checking through a file for UTF-16 encoding patterns.
It relies on the first character being in the subset that has 2nd byte zero.
I don't know of a method of checking through a file for UTF-16 encoding patterns.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Re: Determining the encoding of a string
My thinking is that the UTF-16LE files a plugin author will need to deal with are either FH config files (were the encoding is generally known in advance, such as Fact Sets or Queries ), or Gedcom files where the encoding isn't known but FH appends the UTF-16LE BOM.
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
Yes, that should work OK.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Re: Determining the encoding of a string
Mike,
I've just been using this snippet to detect the encoding of a UTF-16 file with BOM and for some reason it isn't working.
Thoughts? (Test file attached). I may simply be stupid, of course...
I've just been using this snippet to detect the encoding of a UTF-16 file with BOM and for some reason it isn't working.
Thoughts? (Test file attached). I may simply be stupid, of course...
Code: Select all
function TestEncoding(sText)
local bUnicode, iBits
if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
bUnicode = true
iBits = 8
elseif sText:match("^\xFF\xFE") -- "ÿþ" = UTF-16 BOM
or sText:match("^.\0") then -- UTF-16 2-byte encoding 2nd byte 0
bUnicode = true
iBits = 16
else
bUnicode = false -- ANSI
end
return bUnicode, iBits
end
f = io.open("D:\\UTF16 Copy.txt",'r')
local t = f:read('a')
f:close()
bUnicode, iBits = TestEncoding(t)
- Attachments
-
- UTF16 Copy.txt
- (852 Bytes) Downloaded 51 times
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
In what way is it not working, as the function returns bUnicode=true and iBits=16 for me.
Does it work for files with a UTF-8 BOM?
Have you overloaded the string.match() function in some way to upset matching?
With Plugin File Encoding set to ANSI that UTF16 file content is shown in debugger as below with the UTF-16 BOM ÿþ:
ÿþ>@5<
It works just the same with Plugin File Encoding set to UTF-8 but the file content is shown in debugger as:
��>@5<
Does it work for files with a UTF-8 BOM?
Have you overloaded the string.match() function in some way to upset matching?
With Plugin File Encoding set to ANSI that UTF16 file content is shown in debugger as below with the UTF-16 BOM ÿþ:
ÿþ>@5<
It works just the same with Plugin File Encoding set to UTF-8 but the file content is shown in debugger as:
��>@5<
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Re: Determining the encoding of a string
It works for UTF-8 with a BOM.
And yes, I've overlaid the string library with the .utf8 library , so that string handling from Penlight works for utf8 encoding -- although I haven't tested every function, by inspection Penlight string handling relies on the string library.
Looks like I'll need to develop my own snippet to work in that environment. (I've tested without the overlay and yours works fine -- may need to add some caveats to the snippet in the kb).
And yes, I've overlaid the string library with the .utf8 library , so that string handling from Penlight works for utf8 encoding -- although I haven't tested every function, by inspection Penlight string handling relies on the string library.
Looks like I'll need to develop my own snippet to work in that environment. (I've tested without the overlay and yours works fine -- may need to add some caveats to the snippet in the kb).
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
The following code works with utf8 overlaid on string if that is of any use.
It assumes UTF-8 string encoding is the default.
It assumes UTF-8 string encoding is the default.
Code: Select all
function TestEncoding(sText)
local bUnicode, iBits
fhSetStringEncoding("ANSI")
if sText:match("^\xEF\xBB\xBF") -- "" = UTF-8 BOM
or sText:match("[\xC2-\xF4][\x80-\xBF]+") then -- UTF-8 multi-byte encoding pattern
bUnicode = true
iBits = 8
elseif fhCallBuiltInFunction('LeftText',sText,2) == "\xFF\xFE" -- "ÿþ" = UTF-16 BOM
or fhCallBuiltInFunction('MidText',sText,2,1) == "\0" then -- UTF-16 2-byte encoding 2nd byte 0
bUnicode = true
iBits = 16
else
bUnicode = false -- ANSI
end
fhSetStringEncoding("UTF-8")
return bUnicode, iBits
end
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
- ColeValleyGirl
- Megastar
- Posts: 4853
- Joined: 28 Dec 2005 22:02
- Family Historian: V7
- Location: Cirencester, Gloucestershire
- Contact:
Re: Determining the encoding of a string
Thanks, Mike. I'll update the snippet to use that code instead. (I'll include a line to save the current string encoding before changing it, and restore it at the end.
It doesn't work with files read using fhLoadTextFile; I assume FH strips the BOM. But then, you need to know the encoding before you use fhLoadTextFile...
In practice, I suspect the only significant use for this is determining whether a Gedcom file is UTF-16LE or UTF-8 (for plugin authors who want to access the Gedcom directly and not via the API, and need to handle both encodings).
It doesn't work with files read using fhLoadTextFile; I assume FH strips the BOM. But then, you need to know the encoding before you use fhLoadTextFile...
In practice, I suspect the only significant use for this is determining whether a Gedcom file is UTF-16LE or UTF-8 (for plugin authors who want to access the Gedcom directly and not via the API, and need to handle both encodings).
Helen Wright
ColeValleyGirl's family history
ColeValleyGirl's family history
- tatewise
- Megastar
- Posts: 27080
- Joined: 25 May 2010 11:00
- Family Historian: V7
- Location: Torbay, Devon, UK
- Contact:
Re: Determining the encoding of a string
Yes, OK.
I also tried fhLoadTextFile() but with a similar lack of success.
Tried using Lua string.sub() and string.byte() without success.
Some of my plugins allow other kinds of file supplied by users that could use any encoding, but I agree most likely for GEDCOM.
I also tried fhLoadTextFile() but with a similar lack of success.
Tried using Lua string.sub() and string.byte() without success.
Some of my plugins allow other kinds of file supplied by users that could use any encoding, but I agree most likely for GEDCOM.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry