* Question for Mike Tate about Encoder code

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
Post Reply
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

I'm trying to use

Code: Select all

function StrUTF8_UTF16(strText)
but am having problems.

Is the parameter strText supposed to be a single UTF8 'character'? I was assuming not, as StrUtf8toUtf16 (which is called by StrUTF8_UTF16) takes a single character.

However, this code doesn't return what I expect.

Code: Select all

    
    function UTF8MultiLinetoUTF16 (strString)
        --strString is a UTF8 string with multiple lines within it

        local tblConvertedLines = stringx.splitlines(strString) 

        --splitlines is a Penlight function to split a string into a list of lines. "\r", "\n", and "\r\n" are considered line ends but not included in the lines -- I have verified that the output of this stage is as expected
 
        for i,v in ipairs(tblConvertedLines) do
            tblConvertedLines[i]=StrUTF8_UTF16(v)
        end
        
        return tblConvertedLines
         
    end
    
Input (UTF8)

[.index]
Ver1=4
Ver2=0
Count=1
Item1=_ATTR-TASK-IA
[FCT-_ATTR-TASK-IA]
Name=Task
Template=<br>{label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}
Event Tab={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}
Rec Win={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}
Label=Task
Abbr=
Timeframe=POST-DEATH
Field Date=0
Field Age=0
Field Place=0
Field Address=0
Field Note=1
Fast-Add Menu=Y
Hidden=N
[Text-FCT-_ATTR-TASK-IA-Auto Note]
Count=7
Line1=n;Мoкв: ;
Line2=n;Priority: ;
Line3=n;--------------------;
Line4=n;Objective:;
Line5=n;--------------------;
Line6=n;Notes:;
Line7=0;
[FCT-_ATTR-TASK-IA-ROLE]
Roles=0

Output:

Ver1=4਍嘀攀爀㈀㴀 ഀ
Count=1਍䤀琀攀洀㄀㴀开䄀吀吀刀ⴀ吀䄀匀䬀ⴀ䤀䄀ഀ
[FCT-_ATTR-TASK-IA]਍一愀洀攀㴀吀愀猀欀ഀ
Template=<br>{label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}਍䔀瘀攀渀琀 吀愀戀㴀笀氀愀戀攀氀紀㨀 㰀笀瘀愀氀甀攀紀㸀 笀㴀䌀漀洀戀椀渀攀吀攀砀琀⠀ ∀嬀ᰀ漄㨀㈄㨄 ∀Ⰰ 䜀攀琀䰀愀戀攀氀氀攀搀吀攀砀琀⠀ ─䘀䄀䌀吀⸀一伀吀䔀㈀─Ⰰ ∀ᰀ漄㨀㈄㨄 ∀ ⤀Ⰰ ∀崀∀Ⰰ ∀∀ ⤀紀 笀㴀䌀漀洀戀椀渀攀吀攀砀琀⠀ ∀嬀倀爀椀漀爀椀琀礀㨀 ∀Ⰰ 䜀攀琀䰀愀戀攀氀氀攀搀吀攀砀琀⠀─䘀䄀䌀吀⸀一伀吀䔀㈀─Ⰰ ∀倀爀椀漀爀椀琀礀㨀 ∀ ⤀Ⰰ ∀崀∀Ⰰ ∀∀ ⤀紀ഀ
Rec Win={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}਍䰀愀戀攀氀㴀吀愀猀欀ഀ
Abbr=਍吀椀洀攀昀爀愀洀攀㴀倀伀匀吀ⴀ䐀䔀䄀吀䠀ഀ
Field Date=0਍䘀椀攀氀搀 䄀最攀㴀 ഀ
Field Place=0਍䘀椀攀氀搀 䄀搀搀爀攀猀猀㴀 ഀ
Field Note=1਍䘀愀猀琀ⴀ䄀搀搀 䴀攀渀甀㴀夀ഀ
Hidden=N਍嬀吀攀砀琀ⴀ䘀䌀吀ⴀ开䄀吀吀刀ⴀ吀䄀匀䬀ⴀ䤀䄀ⴀ䄀甀琀漀 一漀琀攀崀ഀ
Count=7਍䰀椀渀攀㄀㴀渀㬀ᰀ漄㨀㈄㨄 㬀ഀ
Line2=n;Priority: ;਍䰀椀渀攀㌀㴀渀㬀ⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀ㬀ഀ
Line4=n;Objective:;਍䰀椀渀攀㔀㴀渀㬀ⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀ㬀ഀ
Line6=n;Notes:;਍䰀椀渀攀㜀㴀 㬀ഀ
[FCT-_ATTR-TASK-IA-ROLE]਍刀漀氀攀猀㴀 
User avatar
tatewise
Megastar
Posts: 28333
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Question for Mike Tate about Encoder code

Post by tatewise »

StrUTF8_UTF16 does expect multi-character UTF8 strings.

I am sure the encoding works correctly.
If after using StrUTF8_UTF16 you apply StrUTF16_UTF8 to the output you get back the original UTF8 string.

The snag with UTF16 text strings is that originally ASCII characters are each followed by a 0 byte that in many formats terminates strings.

How are you viewing/using the UTF16 output?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

If viewed within the Lua editor embedded in FH, it looks like this:

tblConvertedLines => (table #31)
[1] => "["
[2] => "V"
[3] => "V"
[4] => "C"
[5] => "I"
[6] => "["
[7] => "N"
[8] => "T"
[9] => "E"
[10] => "R"
[11] => "L"
[12] => "A"
[13] => "T"
[14] => "F"
[15] => "F"
[16] => "F"
[17] => "F"
[18] => "F"
[19] => "F"
[20] => "H"
[21] => "["
[22] => "C"
[23] => "L"
[24] => "L"
[25] => "L"
[26] => "L"
[27] => "L"
[28] => "L"
[29] => "L"
[30] => "["
[31] => "R"

Concatenated and saved to a file with a BOM ÿþ at the start and opened with Notepad (attached) , it looks like this:

Ver1=4਍嘀攀爀㈀㴀 ഀ
Count=1਍䤀琀攀洀㄀㴀开䄀吀吀刀ⴀ吀䄀匀䬀ⴀ䤀䄀ഀ
[FCT-_ATTR-TASK-IA]਍一愀洀攀㴀吀愀猀欀ഀ
Template=<br>{label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}਍䔀瘀攀渀琀 吀愀戀㴀笀氀愀戀攀氀紀㨀 㰀笀瘀愀氀甀攀紀㸀 笀㴀䌀漀洀戀椀渀攀吀攀砀琀⠀ ∀嬀ᰀ漄㨀㈄㨄 ∀Ⰰ 䜀攀琀䰀愀戀攀氀氀攀搀吀攀砀琀⠀ ─䘀䄀䌀吀⸀一伀吀䔀㈀─Ⰰ ∀ᰀ漄㨀㈄㨄 ∀ ⤀Ⰰ ∀崀∀Ⰰ ∀∀ ⤀紀 笀㴀䌀漀洀戀椀渀攀吀攀砀琀⠀ ∀嬀倀爀椀漀爀椀琀礀㨀 ∀Ⰰ 䜀攀琀䰀愀戀攀氀氀攀搀吀攀砀琀⠀─䘀䄀䌀吀⸀一伀吀䔀㈀─Ⰰ ∀倀爀椀漀爀椀琀礀㨀 ∀ ⤀Ⰰ ∀崀∀Ⰰ ∀∀ ⤀紀ഀ
Rec Win={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} {=CombineText( "[Priority: ", GetLabelledText(%FACT.NOTE2%, "Priority: " ), "]", "" )}਍䰀愀戀攀氀㴀吀愀猀欀ഀ
Abbr=਍吀椀洀攀昀爀愀洀攀㴀倀伀匀吀ⴀ䐀䔀䄀吀䠀ഀ
Field Date=0਍䘀椀攀氀搀 䄀最攀㴀 ഀ
Field Place=0਍䘀椀攀氀搀 䄀搀搀爀攀猀猀㴀 ഀ
Field Note=1਍䘀愀猀琀ⴀ䄀搀搀 䴀攀渀甀㴀夀ഀ
Hidden=N਍嬀吀攀砀琀ⴀ䘀䌀吀ⴀ开䄀吀吀刀ⴀ吀䄀匀䬀ⴀ䤀䄀ⴀ䄀甀琀漀 一漀琀攀崀ഀ
Count=7਍䰀椀渀攀㄀㴀渀㬀ᰀ漄㨀㈄㨄 㬀ഀ
Line2=n;Priority: ;਍䰀椀渀攀㌀㴀渀㬀ⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀ㬀ഀ
Line4=n;Objective:;਍䰀椀渀攀㔀㴀渀㬀ⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀⴀ㬀ഀ
Line6=n;Notes:;਍䰀椀渀攀㜀㴀 㬀ഀ
[FCT-_ATTR-TASK-IA-ROLE]਍刀漀氀攀猀㴀 

but when viewed with a hex editor it looks OK.

It's supposed to be a Fact set file but FH doesn't recognise the facts when the fact set is installed.
Research Planner.fhf
(2 KiB) Downloaded 135 times
If I save the UTF8 input string as a file and then use Powershell to change the encoding from UTF8 to UTF16, FH opens it and installs the Fact without a problem, so the input string is OK.

I'm stumped...
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

A bit of context -- I'm updating the Research Planner plugin to allow users to customise the fields included when 'Tasks' are created, either within the Planner or within FH by creating a 'Task' fact. So I can't use a fixed-content file but have to build it up based on the options the users choose.
User avatar
tatewise
Megastar
Posts: 28333
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Question for Mike Tate about Encoder code

Post by tatewise »

Remember in Lua all strings are ASCII chars terminated with a 0 byte.
So, the converted lines will look like that because the UTF16 0 byte after first byte terminates the Lua string.

How are you saving to file?
How are you adding the newline chars at end of each line?
Are you coding those newline chars in UTF16 format?

For example in my Export Gedcom File Plugin it writes records using:
local putExport = general.OpenFile(<filespec>,"wb")
local strRecord = doEncode(table.concat(arrRecord,"\r\n").."\r\n")
putExport:write(strBOM..strRecord)

where doEncode invokes one of:
local EncodeANSI = encoder.StrUTF8_ANSI
local EncodeISO = encoder.StrUTF8_ISO
local EncodeUTF8 = tostring
local EncodeUTF16= encoder.StrUTF8_UTF16

The last of which converts UTF8 to UTF16 as you require.

I have concocted a script like yours that seems to work OK.

Code: Select all

-- Split a string using "," or chosen separator --
function split(strTxt,strSep)
	local tblFields = {}
	local strPattern = string.format("([^%s]+)", strSep or ",")
	strTxt = tostring(strTxt or "")
	strTxt:gsub(strPattern, function(strField) tblFields[#tblFields+1] = strField end)
	return tblFields
end -- function split

-- Open File and return Handle --
function OpenFile(strFileName,strMode)
	local fileHandle, strError = io.open(strFileName,strMode)
	if fileHandle == nil then
		error("\n Unable to open file in \""..strMode.."\" mode. \n "..strFileName.." \n "..strError.." \n")
	end
	return fileHandle
end -- function OpenFile

strText = [[
Name=Task
Template=<br>{label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} 
Event Tab={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )} 
Rec Win={label}: <{value}> {=CombineText( "[Мoкв: ", GetLabelledText( %FACT.NOTE2%, "Мoкв: " ), "]", "" )}
Label=Task
]]

    function UTF8MultiLinetoUTF16 (strString)
        --strString is a UTF8 string with multiple lines within it
        local tblConvertedLines = split(strString,"\n")
        for i,v in ipairs(tblConvertedLines) do
            tblConvertedLines[i]= StrUTF8_UTF16(v .. "\r\n")
        end
        return tblConvertedLines
    end

local putExport = OpenFile("C:\\Users\\Mike\\OneDrive\\Desktop\\TestOutput.txt","wb")
local arrRecord = UTF8MultiLinetoUTF16(strText)
local strRecord = table.concat(arrRecord,"")
putExport:write("ÿþ"..strRecord)
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

Sorted -- thanks Mike. For anyone playing along at home, the final code I've ended up with is:

Code: Select all

    
    pl = require("pl.import_into")
    stringx= require("pl.stringx")
    utils = require("pl.utils")

   ConfiguredFactFile = fhGetContextInfo("CI_APP_DATA_FOLDER").."\\Fact Types\\Custom\\"..gcstrPluginName..".fhf"

    function UTF8MultiLinetoUTF16 (strString)
        --input is a multiline string  which can be ANSI (which is UTF8 compatible) or UTF8
        --returns an equivalent UTF16 string
        local tblConvertedLines = stringx.splitlines(strString)
        for i,v in ipairs(tblConvertedLines) do
            tblConvertedLines[i]=StrUTF8_UTF16(v.."\r\n")
        end
        return table.concat(tblConvertedLines, "")  

    function WriteFactDefinitionFile(strDefUTF8)
        local bomUtf16= string.char(0xFF,0xFE)		-- "ÿþ"
        local f = assert(io.open(ConfiguredFactFile, "wb")) -- open in "binary" mode -- throws an error if the open fails
        if f ~= nil then --open succeeded
            f:write(bomUtf16..UTF8MultiLinetoUTF16(strDefUTF8))
            f:close()
        end
    end
    
User avatar
tatewise
Megastar
Posts: 28333
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Question for Mike Tate about Encoder code

Post by tatewise »

Your UTF8MultiLinetoUTF16 encodes and returns the entire string, so why split it into lines?
Instead of:
f:write(bomUtf16..UTF8MultiLinetoUTF16(strDefUTF8))
why not use:
f:write(bomUtf16..StrUTF8_UTF16(strDefUTF8))
and dispense with UTF8MultiLinetoUTF16 entirely?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

Because when I originally tried it way back at the beginning of this saga, it didn't appear to be handling the newline characters properly. I'll take another look.
User avatar
tatewise
Megastar
Posts: 28333
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Question for Mike Tate about Encoder code

Post by tatewise »

It may be that you need to ensure each newline has both \r and \n.
i.e.
f:write( bomUtf16 .. StrUTF8_UTF16( strDefUTF8:gsub("\n","\r\n") ) )
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

That's what I suspect, Mike, but is it worth going through all my code and code snippets making that change and retesting everything? If I don't do it all at once, my memory will let me down and something will break down the line... Given that just "\n" works in everything else, I might let sleeping dogs lie :|
User avatar
tatewise
Megastar
Posts: 28333
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Question for Mike Tate about Encoder code

Post by tatewise »

OK, understood, but the change I suggested only affects the final writing of the output file and nowhere else.

f:write( bomUtf16 .. StrUTF8_UTF16( strDefUTF8:gsub("\n","\r\n") ) )

or to be absolutely foolproof

f:write( bomUtf16 .. StrUTF8_UTF16( strDefUTF8:gsub("([^\r])\n","%1\r\n") ) )

so that if there already is a \r\n pairing then an extra \r won't get inserted, but my tests suggest that does not matter if there are multiple \r.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry
User avatar
ColeValleyGirl
Megastar
Posts: 5464
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Question for Mike Tate about Encoder code

Post by ColeValleyGirl »

I think I'll wrap the encoding, BOM and newline handling up in a couple of UTF16 file handling routines so that everything elsewhere in my plugins is 8-bit clean.
Post Reply