* Problem with utf8 library

For users to report plugin bugs and request plugin enhancements; and for authors to test new/new versions of plugins, and to discuss plugin development (in the Programming Technicalities sub-forum). If you want advice on choosing or using a plugin, please ask in General Usage or an appropriate sub-forum.
User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Problem with utf8 library

Post by tatewise » 09 Jul 2021 13:53

I don't know what I am doing wrong but some utf8 library functions do not seem to work as expected.
The sample script below does not behave as expected for the utf8.lower() and utf8.upper() functions.
They do not handle any of the non-ANSI accented UTF-8 characters.
They behave exactly the same as the standard string.lower() and string.upper() functions.
The script File > Encoding is UTF-8.

Code: Select all

utf8 = require(".utf8"):init()
local strText = "Smitħ điane eLiçabĘth Ĥélźton"
local strLower = utf8.lower(strText)
local strUpper = utf8.upper(strText)
fhMessageBox(strText.."\n"..strLower.."\n"..strUpper)
The same problem affects both FH v7.0.7 and FH v6.2.7 following the advice in Lua References and Library Modules for utf8.

My current workaround is to use my string library module that is modified to handle UTF-8 and is in many of my Plugins.
However, strangely, with that library enabled the utf8.lower() and utf8.upper() functions behave correctly. Very odd!
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 09 Jul 2021 14:13

I think the problem may be in the FH debugger and fhMessage.

Try

Code: Select all

utf8 = require(".utf8"):init()
local strText = "Smitħ điane eLiçabĘth Ĥélźton"
local strLower = utf8.lower(strText)
local strUpper = utf8.upper(strText)
require "fhutils"
iup.Message("Test",strText.."\n"..strLower.."\n"..strUpper)

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 09 Jul 2021 14:34

fhutils makes no difference and the script fails when Run normally not in the debugger.
Also, it fails in FH v6.2.7 where fhutils is not available,
Please try the script yourself.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 09 Jul 2021 14:44

My script runs to completion as shown, and also when run not in the debugger. FH7.0.7

For FH6 replace

Code: Select all

require "fhutils"
with

Code: Select all

require "iuplua"
(also works in FH7 -- I was just being lazy with fhUtils) and include compat53 as per the Lua references documentation in the kb (I haven't tested on FH6 but wouldn't expect a problem).
Screenshot 2021-07-09 154248.png
Screenshot 2021-07-09 154248.png (27.21 KiB) Viewed 5053 times

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 09 Jul 2021 15:06

Maybe I did not make clear what I meant.
The utf8.lower() and utf8.upper() functions should convert the original text into all lower case and all upper case UTF-8 versions similar to the string.lower() and string.upper() functions for purely ANSI strings.
e.g.
strText = Smitħ điane eLiçabĘth Ĥélźton
strLower = smitħ điane eliçabęth ĥélźton
strUpper = SMITĦ ĐIANE ELIÇABĘTH ĤÉLŹTON

If I use my string library the fhMessageBox() displays the above text correctly.

In the debugger and with fhMessageBox() shows the UTF-8 characters are mishandled by utf8.lower() and utf8.upper() .
What they actually do is use the string.lower() and string.upper() functions unchanged from the string library.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 09 Jul 2021 15:50

The code within the library says

Code: Select all

-- If utf8data.lua (containing the lower<->upper case mappings) is loaded, these
-- additional functions are available:
-- * utf8upper(s)
-- * utf8lower(s)
However, utf8data isn't part of this library and having had a quick look for it, it seems just to be a big two-way lookup table, not including every possible ut8 character.

If utf8data isn't loaded, the utf8.lower and utf8.upper functions aren't defined, and I guess your library then does so?

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 09 Jul 2021 16:47

"If utf8data isn't loaded, the utf8.lower and utf8.upper functions aren't defined."
The utf8.lower and utf8.upper functions are always defined, i.e. those functions always exist in the utf8 table.
See the script in my OP. That script is all that exists in my test scenario. My library is not involved.

The problem is that without utf8data.lua they are silently mapped onto the string.lower and string.upper functions.
So plugin authors need to be aware of this limitation of the utf8 library.

The utf8 table has 26 functions as listed below. I wonder which ones actually perform a useful utf8 function?
The online documentation does not provide many clues. I know the the Lua 5.3 utf8 functions are useful.
utf8 => (table .32)
validator => (function)
gensub => (function)
require => (function)
next => (function)
dump => (function)
gmatch => (function)
sub => (function)
init => (function)
codes => (function)
rep => (function)
lower => (function)
find => (function)
unicode => (function)
len => (function)
len53 => (function)
format => (function)
codepoint => (function)
offset => (function)
debug => (function)
byte => (function)
reverse => (function)
upper => (function)
char => (function)
match => (function)
validate => (function)
gsub => (function)
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 09 Jul 2021 17:17

My reference to your library was basd on this:
However, strangely, with that library enabled the utf8.lower() and utf8.upper() functions behave correctly. Very odd
Re useful functions:
This library can be used as drop-in replacement for vanilla string library. It exports all vanilla functions under raw sub-object.

[snip]

It also provides all functions from Lua 5.3 UTF-8 module except utf8.len (s [, i [, j]]). If you need to validate your strings use utf8.validate(str, byte_pos) or iterate over with utf8.validator.
So the useful items are the functions equivalent to the string library and the lua utf8 library, would be my reading. Plus, the utf8 lower and upper functions will not work without the relevant lookup table (without which they default to vanilla string.upper and string.lower function) -- it would help if that was covered in the readme, I agree.

Relevant: https://github.com/Stepets/utf8.lua/issues/7. I suggest you follow up on github if you want to get things changed.

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 09 Jul 2021 17:23

So perhaps there needs to be a note in Lua References and Library Modules against the utf8 library mentioning that utf8.lower and utf8.upper are the same as string.lower and string.upper?

My stringx library overloads the string library and gave me the impression that utf8.lower and utf8.upper were working, but when I removed my stringx library they stopped working and confused me completely. My stringx constructs a case translation table that is presumably similar to utf8data.lua and currently only covers the most popular European alphabets but is easily extended.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 09 Jul 2021 17:43

Maybe make the relevant parts of your string library available as a snippet, with instructions to use it alongside the utf8 library? Combined with a warning in the Lua References and Library Modules to use the snippet if lower and upper are required?

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 09 Jul 2021 18:00

It is already there in Unicode String Functions derived from the old KB, but has been commented to say it is superseded by the utf8 library, which we now know is not the case for the lower and upper functions.

If not required, the length and substring and caseless functions can simply be deleted.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 10 Jul 2021 12:10

Mike, email me the exact code you suggest retaining and the accompanying text -- otherwise I may get it wrong (I have a lot of health stuff on my plate right now).

My preference would be for something that provides either utf8.lower and utf8.upper or an utf8data.lua module to make things simpler for people using it; as it stands the snippet provides string.lower and string.upper.

Once we know what the snippet is going to provide we can agree on wording for the References and Libraries article.

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 10 Jul 2021 13:00

I'd prefer my snippet to continue to provide string.lower and string.upper and for that matter string.length, etc, as it does now.

The advantage is that the snippet can be used on its own when only the lower, upper & length functions are needed.
That way any existing script that uses string.lower &/or string.upper automatically adapts to handling UTF8 characters without change and without needing the utf8 library. It also remains compatible with FH v5.

When the utf8 library is installed and utf8data.lua is missing, it automatically maps the utf8.lower and utf8.upper functions onto the string.lower and string.upper functions, so they both perform the same UTF8 conversions.
Therefore, my snippet does not need to explicitly provide utf8.lower and utf8.upper functions.

If we can agree on that strategy then I can supply the details including the above explanation.

Alternatively, I have found various copies of utf8data.lua, but ignoring licencing issues for the moment, have tried copying it into C:\ProgramData\Calico Pie\Family Historian\Plugins and C:\Program Files (x86)\Family Historian\Program\Lua and also tried require("utf8data") which loads the translation tables, but without any beneficial effect on the utf8 library.
Have you any ideas where utf8data.lua should be loaded so that the utf8 library detects its presence?

Otherwise, it could easily be incorporated into the string.lower and string.upper snippet and provide conversion for 930 Unicode characters instead of the 120 that my code offers.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 11 Jul 2021 10:48

Mike, I think our minds are working on similar lines.

I would like to see a solution that:
  • works alongside the utf8 library in FH5, 6 and 7 for people who need utf8 functionality *and* upper and lower (I believe compat53/utf8 should work in fh5 -- there's no reason why not -- but have not tested it and won't have time to do so soon, hence not documenting it in the kb)
  • works standalone for people who only need the lower/upper/length functions.
I haven't experimented with utfdata.lua -- sorry. However, I believe all the versions are open-source, so the 930 characters rather than 120 would be good, if you have time to do it :D

If you can supply the new details for the snippet article, and suggested wording changes for the library and reference article, I'd be very grateful. Email please?

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 11 Jul 2021 16:40

OK, I'll work on that solution but it won't be for a few days.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 12 Jul 2021 15:01

In the meantime, I've discovered why the FH utf8 library does not detect utf8data.lua being loaded.
The utf8data.lua file has tables utf8_lc_uc and utf8_uc_lc to translate characters.

The file C:\Program Files (x86)\Family Historian\Program\Lua\utf8\primitives\dummy.lua contains the id & comments:

Code: Select all

-- $Id: utf8.lua 179 2009-04-03 18:10:03Z pasta $

-- If utf8data.lua (containing the lower<->upper case mappings) is loaded, these
-- additional functions are available:
-- * utf8upper(s)
-- * utf8lower(s)
The file https://github.com/subsoap/defsave/blob ... e/utf8.lua has identical id & comments but also the commented out script:

Code: Select all

--[[
-- replace UTF-8 characters based on a mapping table
local function utf8replace (s, mapping)
: : : : : : :
return newstr
end

-- identical to string.upper except it knows about unicode simple case conversions
local function utf8upper (s)
	return utf8replace(s, utf8_lc_uc)
end
-- identical to string.lower except it knows about unicode simple case conversions
local function utf8lower (s)
	return utf8replace(s, utf8_uc_lc)
end
]]
The file https://github.com/vhallac/ouf_grid/blo ... 8/utf8.lua has an earlier id 147 and script:

Code: Select all

-- $Id: utf8.lua 147 2007-01-04 00:57:00Z pasta $

-- identical to string.upper except it knows about unicode simple case conversions
local function utf8upper (s)
	return utf8replace(s, utf8_lc_uc)
end

-- install in the string library
if not string.utf8upper and utf8_lc_uc then
	string.utf8upper = utf8upper
end

-- identical to string.lower except it knows about unicode simple case conversions
local function utf8lower (s)
	return utf8replace(s, utf8_uc_lc)
end

-- install in the string library
if not string.utf8lower and utf8_uc_lc then
	string.utf8lower = utf8lower
end
The FH utf8 library files do not contain any references to utf8upper, utf8_lc_uc, utf8lower, or utf8_uc_lc.
So even if require("utf8data") is used, the library has nothing to detect or utilise its tables.

Therefore, one solution is for CP to incorporate the above script to honour their dummy.lua comment:

Code: Select all

-- If utf8data.lua (containing the lower<->upper case mappings) is loaded, these
-- additional functions are available:
-- * utf8upper(s)
-- * utf8lower(s)
Do you think that solution is worth persuing?
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 12 Jul 2021 15:18

Frankly, Mike, no. Calico Pie are not responsible for the library and the forks you're looking at aren't the one that I packaged. You could if you felt strongly (as I've already suggested) raise an issue on https://github.com/Stepets/utf8.lua -- Stepan seems quite responsive.

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 12 Jul 2021 17:06

Sorry, but I didn't appreciate the significance of your earlier link and its relevance to the utf8 version packaged in FH by CP.
Furthermore, I have only just discovered the earlier utf8 versions that explain how utf8data.lua used to be detected.
I have posted an issue at lower() and upper() support for utf8data.lua #13 and await a reply.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 12 Jul 2021 17:12

Mike, just to correct a point of fact: I did the packaging of uttf8 using the code from github at https://github.com/Stepets/utf8.lua; Calico Pie agreed to include it in their distribution, but were not otherwise involved.

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 13 Jul 2021 10:19

Stepan has responded and has provided an updated utf8 version at https://github.com/Stepets/utf8.lua/tre ... r_utf8data.

The README.md says at the end of Configuration:
For lower and upper functions to work in environments where ffi cannot be used, you can specify substitution tables
(data example)

Code: Select all

local utf8 = require('.utf8')
utf8.config = {
  conversion = {
    uc_lc = utf8_uc_lc,
    lc_uc = utf8_lc_uc
  },
}
utf8:init()
The data example link refers to utf8data.lua that can be included in the FH installation ...\Program\Lua\ folder.
Then using require("utf8data") prior to the above code provides full utf8 lower and upper conversions.
e.g.

Code: Select all

require("utf8data")
local utf8 = require(".utf8")
utf8.config = { conversion = { uc_lc = utf8_uc_lc; lc_uc = utf8_lc_uc; } }
utf8:init()
Helen, are you able to package that utf8 library version with utf8data.lua in a future update of FH v7.0?

Then there will be no need for a string library snippet except as a standalone feature as exists now.
To adjust the string library functions so they are replaced by the utf8 library functions simply needs:

Code: Select all

for k,v in pairs(utf8) do
  string[k] = v
end
Then for example len = s:len() and s = s:lower() and s = s:upper() would handle UTF-8 character strings, so existing plugin scripts would need no changes apart from the above configuration code and would be v5, v6 & v7 compatible in conjunction with the compat53 library.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 13 Jul 2021 15:34

I'll get back to you, Mike.

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 13 Jul 2021 17:01

There is one small improvement I've suggested at the end of my posting at lower() and upper() support for utf8data.lua #13.

P.S. Stepan has now implemented that suggestion. Impressive!
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 14 Jul 2021 14:53

There is a problem with utf8.charpattern in FH v6 Lua 5.1 with compat53, which I have added to the end of my posting at lower() and upper() support for utf8data.lua #13.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
tatewise
Megastar
Posts: 27087
Joined: 25 May 2010 11:00
Family Historian: V7
Location: Torbay, Devon, UK
Contact:

Re: Problem with utf8 library

Post by tatewise » 15 Jul 2021 11:39

Stepan has fixed the problem by using utf8.charpattern = "[%z\1-\127\194-\244][\128-\191]*" unconditionally.
This works because %z although deprecated is still supported in FH v7 Lua 5.3 and Lua 5.4 too.

So in Writing and Maintaining Plugins Compatible with Versions 5, 6 & 7 under Plugins that Originated in Version 5 or 6 we should change the advice for %z and \0 to say continue using %z and avoid \0 to maintain v5, v6 & v7 compatibility.
It also says "%z was deprecated in Lua 5.2 and removed in Lua 5.3" which is a mistake as it was not removed.

As well as the utf8 library, and the utf8data.lua mapping file, a utf8lua.lua file could all be packaged together where:

Code: Select all

-- utf8lua.lua initialisation file
utf8 = require(".utf8")
require("utf8data")	-- As per utf8 library README.md Configuration: for lower and upper functions
utf8.config = { conversion = { uc_lc = utf8_uc_lc; lc_uc = utf8_lc_uc; } }
utf8:init()
Then the KB Lua References and Library Modules advice for using compat53 and utf8 would suggest the simpler script:

Code: Select all

if fhGetAppVersion() <= 6 then
	loadrequire("utf8")
	loadrequire("compat53")
end
require("utf8lua")
However, advanced users could do their own thing with the utf8 library and utf8data.lua file if they wish.
Mike Tate ~ researching the Tate and Scott family history ~ tatewise ancestry

User avatar
ColeValleyGirl
Megastar
Posts: 4854
Joined: 28 Dec 2005 22:02
Family Historian: V7
Location: Cirencester, Gloucestershire
Contact:

Re: Problem with utf8 library

Post by ColeValleyGirl » 16 Jul 2021 09:22

Using %z feels like kicking the can down the road, but at some point we'll stop worrying about Lua 5.1 compatability and can deal with the can then. See what you think of my new wording for Writing and Maintaining Plugins Compatible with Versions 5, 6 & 7

When I get the word that Stepan has updated the master version of the library, I'll package it up with your suggested editions and pass it along to CP.

Post Reply