Page 1 of 2

Checking string is ASCII

Posted: 05 Apr 2022 08:54
by Mark1834
I need to check that a string contains only pure ASCII characters to ensure full compatibility across a number of different systems. There doesn't seem to be an existing function to do this, so the best option I've come up with is a two step process:
  1. Use fhConvertUTF8toANSI to convert to ANSI, and if fhIsConversionLossFlagSet() returns true, it is a UTF string.
  2. Check that the converted string doesn't contain byte values above 127.
Any better ideas, please?

Re: Checking string is ASCII

Posted: 05 Apr 2022 09:18
by ColeValleyGirl
Determine the Character Encoding of a File tests for ANSI or UTF. Can you be sure your string include BOMs if its Unicode?

You'll need another step for ASCII though... I assume you're planning to use pattern matching to find any non-ASCII characters?

Re: Checking string is ASCII

Posted: 05 Apr 2022 09:51
by Mark1834
Not just that - I'm looking at Windows batch files/command scripts and how they might be processed by a simple plugin. The scripts are very powerful, and processes files in any language in any locale, but the encoding of the script itself is turning out to be rather more complicated than I imagined. Windows doesn't run batch files encoded with a BOM. Neither the new FH7 text file functions or fhFileUtils can create UTF text files without a BOM. The new FH7 functions for reading and writing ini files only work in pure ASCII (reported to CP). IMO, it's a right old bu**er's muddle :?.

The simplest solution is to keep everything in plain ASCII. Windows is happy, and it works in FH5/6/7. The file handling is simple enough to just use the plain lua options. I could create more complex solutions, but it could quickly get a lot more complex for little actual gain in functionality. IMO that is an acceptable trade-off, particularly at this early stage of development.

Re: Checking string is ASCII

Posted: 05 Apr 2022 10:07
by ColeValleyGirl
Mark1834 wrote:
05 Apr 2022 09:51
The new FH7 functions for reading and writing ini files only work in pure ASCII (reported to CP).
In my experience with other programmes, that's done deliberately so that ini files are independent of locale. This simplicity is for the same reason (taken from the Help File):
•All 'constant strings' in the Family Historian API are always ASCII and are consequently identical in ANSI and UTF-8. For example, suppose you want to call fhGetContextInfo to get the name of the current plugin. To do this, you need to pass it the text string 'CI_PLUGIN_NAME' - a constant string. This string consists of alphabetic characters and underscores, all of which are ASCII characters, so it is both a valid ANSI string and a valid UTF-8string. So there is no need to write fhGetContextInfo(fhConvertANSItoUTF8('CI_PLUGIN_NAME')), even in a module (see next section). It will work, but it's unnecessary. You instead write fhGetContextInfo('CI_PLUGIN_NAME'), and this will work, regardless of the current string encoding.

•All data references and tags in the Family Historian API are also guaranteed to only use ASCII characters. So, again, you don't need to worry about whether these are ANSI or UTF-8 as they will always be valid strings in either encoding.

Re: Checking string is ASCII

Posted: 05 Apr 2022 10:21
by tatewise
Mark, I am with you all the way.
I have been banging on about such ASCII, ANSI, and Unicode issues for years, and as you say it is still a muddle.
Handling such character encodings in strings and filepaths is a nightmare for any but the simplest operations.

If it is of any help, some of my plugins run batch scripts, especially in Backup and Restore FH Settings.
There is a knack to getting them to run 'silently' and 'invisibly' but also 'visibly' for debugging.
The latest fhSaveTextFile(...) function supports "ANSI" encoding and therefore ASCII text without a BOM.
Also, assuming your batch script file has a 'safe' ANSI filepath, the Lua io.open, write & close functions will save any encoded text without a BOM.

Re: Checking string is ASCII

Posted: 05 Apr 2022 11:56
by Mark1834
All the stuff in %APPDATA% and %PROGRAMDATA% is ok, as Windows takes care of their location in non-English versions, but it’s a pity if FH ini files are designed deliberately so I can’t save something like “Path=(non-ASCII local user file)”.

My executive decision on this morning’s canal-side dog walk is to keep things simple by restricting the user-defined save path to ASCII only. I can then use plain Lua for reading and writing the configuration file and exporting the script itself. It will work equally in FH6 and 7 with just a small increase in complexity, and I can concentrate on the core functionality rather than getting bogged down in tedious encoding issues.

Re: Checking string is ASCII

Posted: 05 Apr 2022 12:20
by tatewise
FYI: My plugins often use the C:\ProgramData\Calico Pie\Family Historian\Plugin Data\ folder with a safe filename for any temporary scripts, etc.

Re: Checking string is ASCII

Posted: 05 Apr 2022 12:41
by ColeValleyGirl
Mark, not tested but I've seen suggestions that if you create the ini file as an empty UTF16 file just containing a BOM, it will then accept UTF16 characters. See http://archives.miloush.net/michkap/arc ... 54992.html.

This is of course assuming that CP are using the relevant Windows API to implement the function.

Edited to add: Tested and it works.

Code: Select all

fhfu = require ('fhFileUtils')
strFilePath = "D:\\OneDrive\\Desktop\\myini.ini"
bOK = fhSaveTextFile(strFilePath,"", "UTF-16LE")
bOK = fhSetIniFileValue(strFilePath,"Main", "цу", "text", "цонцлусионемяуе цу")
var= fhGetIniFileValue(strFilePath,"Main", "цу", "text", "broken")
print (var)

Re: Checking string is ASCII

Posted: 05 Apr 2022 13:32
by Mark1834
Curious... CP have responded with a “logged for developers” rather than “this is by design”, so I’ll pass on your observation. It does suggest that something is not quite right. I think I’ll adopt the standard ini file structure even if I construct it in vanilla Lua in order to keep it forwards compatible.

Re: Checking string is ASCII

Posted: 05 Apr 2022 13:41
by ColeValleyGirl
It may not be by design, Mark. The documentation of the Windows ini file API is pretty poor, as that reference I gave you suggests, so CP may not even be aware of the way it works.

Re: Checking string is ASCII

Posted: 05 Apr 2022 13:59
by Mark1834
Is it possible to save UTF key vales in the Registry, or does that also run into encoding issues with Luacom?

Re: Checking string is ASCII

Posted: 05 Apr 2022 14:28
by Jane
Logged for Developers means it's been passed to a developer to check, it does not mean it has been identified as a fault.

If you want to put your plugin in the plugin store you need to avoid writing to the registry.

Re: Checking string is ASCII

Posted: 05 Apr 2022 14:38
by tatewise
I would save them in the C:\ProgramData\Calico Pie\Family Historian\Plugin Data\ folder.
See FHUG Snippet Preserve User Settings.

Re: Checking string is ASCII

Posted: 05 Apr 2022 14:40
by Mark1834
Thanks Jane. If even creating new Registry keys under the FH global and user keys is regarded as unacceptable, is it worth adding that to the guidance for authors?

I agree that saving in ProgramData is easier, but Microsoft can't seem to make up their minds about whether the Registry or ProgramData/AppData is the preferred option...

Re: Checking string is ASCII

Posted: 05 Apr 2022 15:16
by ColeValleyGirl
I'm missing something...

Why not just create an empty UTF16 Ini file wherever you want to put it using

Code: Select all

fhSaveTextFile(strFilePath,"", "UTF-16LE")
and then access it via fhGetIniFileValue and fhSetIniFileValue?

Re: Checking string is ASCII

Posted: 05 Apr 2022 17:59
by Mark1834
Simple - it would make the plugin FH7 only, and I don’t want to do that unless absolutely necessary. The key issue is Windows batch file encoding rather than handling ini files, and I’ve got a couple of ideas I want to test...

Re: Checking string is ASCII

Posted: 06 Apr 2022 08:36
by Mark1834
Sorted - a number of issues were getting intertwined here, so to summarise...
  1. Windows commands entered at the command line are fully UTF-compliant.
  2. A batch file/command script can only contain ASCII characters. Even simple "ANSI", such as an accented Latin character, does not get recognised correctly. I'm not suggesting that is an absolute restriction, but applies to my UK "out of the box" configuration, and I don't want to go messing with Windows settings to change it.
  3. A batch file/command script must be encoded without a BOM. Even if the script contains only ASCII characters, it will not be read correctly if saved as UTF-8 BOM. Windows reads the 3-byte BOM as the start of a command. A workaround is to leave the first line blank, but that is not very elegant.
  4. Therefore, if I am processing commands of the type robocopy source target /options, both the source and target need to be ASCII only. Source is no problem, as that is either %PROGRAMDATA% or %APPDATA%, but the target path has to be ASCII as well.
  5. I wasn't sufficiently clear in my mind between file path and file content. Plain Lua can read and write UTF strings to file perfectly happily - they are just byte sequences and the file routines don't care how they are interpreted. If you write a UTF-8 string to file with plain Lua, the file is encoded in UTF-8 with no BOM. This appears to be the preferred format for UTF-8 anyway from what I have read, so I don't know why CP use the BOM. However, it is what it is, and we just have to deal with it that way.
  6. Most plugins process FH data and manage files in arbitrary locations. These need to be fully UTF-compliant. For a new plugin, it's a no-brainer to go down the FH7 route, as force-fitting better UTF-compliance into FH5/6 for a small and declining market is not usually worth the complexity, except for popular plugins that are still widely used. This case is different, as I don't actually need the extra FH7 tools. Checking the ASCII Plugin Data folder is one line in lfs and one line in fhFileUtils. Reading and writing the batch file is either three lines in plain Lua (open, read or write, close) or a single fhLoad/SaveTextFile(File,Contents,'ANSI') command. Reading and writing the ini configuration file is slightly easier using the new FH7 function, but the difference is marginal in this simple case. Killing backwards-compatibility to save a handful of lines would not be the right call here (IMO, of course ;)).

Re: Checking string is ASCII

Posted: 06 Apr 2022 09:42
by tatewise
Regarding point 2. a CMD batch script can contain full UTF-8 characters if you change its Code Page.
Start the .bat file with the CHange Code Page command for page 65001 which is UTF-8:
CHCP 65001

That works on any PC and temporarily allows the CMD application to use full UTF-8 characters during that one script.

So in point 4. neither the source nor target paths are restricted to ASCII.

Re: Checking string is ASCII

Posted: 06 Apr 2022 09:55
by ColeValleyGirl
tatewise wrote:
06 Apr 2022 09:42
Regarding point 2. a CMD batch script can contain full UTF-8 characters if you change its Code Page.
Start the .bat file with the CHange Code Page command for page 65001 which is UTF-8:
CHCP 65001

That works on any PC and temporarily allows the CMD application to use full UTF-8 characters during that one script.

So in point 4. neither the source nor target paths are restricted to ASCII.
It only works unvarnished on PCs running Windows Version 1903 (May 2019 Update) and above. -- earlier builds need extra code, and I'm not sure how you'd do that in a BAT file.

It also affects everything going on on the PC at the same time, which IMO is a no-no (for the same reasons as 'temporary' registry hacks are a no-no.)

Re: Checking string is ASCII

Posted: 06 Apr 2022 09:59
by Mark1834
Agree - plugins should not mess with settings outside FH. Full stop.

Re: Checking string is ASCII

Posted: 06 Apr 2022 10:29
by tatewise
Are we discussing the same thing?
I am not talking about the CHCP.exe application.
I am talking about the CMD console CHCP 65001 batch command that has been available for decades.
As far as I am aware it only affects that one CMD console script temporarily and not the wider locale settings.

I have paused a CMD script running with CHCP 65001 and run another CMD Prompt and a Powershell that both still have code page 850 active.

Re: Checking string is ASCII

Posted: 06 Apr 2022 10:58
by ColeValleyGirl

Re: Checking string is ASCII

Posted: 06 Apr 2022 11:20
by tatewise
Helen, did you mean to post that same SO link twice? Yes, I've seen that SO thread and several others.

The top comment says:
(The highest-voted cautionary comments are 8 years old though, I doubt that they still apply.) – Tomalak Jul 21, 2019
and that was nearly 3 years ago.

Most of the subsequent advice is talking about changing system locale or startup settings.
IMO that is not what CHCP does.
As I understand it, CHCP 65001 in a CMD Batch script only affects that script and no other processes.

Anyway, if users are running Windows 10 versions earlier than 1901 then they can expect problems.

Re: Checking string is ASCII

Posted: 06 Apr 2022 11:39
by ColeValleyGirl
tatewise wrote:
06 Apr 2022 11:20
Helen, did you mean to post that same SO link twice? Yes, I've seen that SO thread and several others.
This is the other link I meant to share: https://stackoverflow.com/q/388490/1943174 which has more recent content and which Mark may find helpful, especially the comment about things depending on what compiler is used. (I have a reference somewhere that says it only works for programmes compiled using Microsoft's compiler and not for example with MinGW, but can't find it right now.
Anyway, if users are running Windows 10 versions earlier than 1901 then they can expect problems.
So, users running Windows 8 or 7 are SOL?

Re: Checking string is ASCII

Posted: 06 Apr 2022 11:55
by Mark1834
Thanks, I’ll follow up the links later for interest and my continuing education. :)

For the moment, I’m happy to leave the ASCII restriction in place rather than wait for a solution. The FH UI is written in English, and its user base is English speakers, either native or as an additional language. Requiring a backup folder to have an English name is not a significant constraint in real world use.

BTW, CP have come back to me with a very detailed reply on ini files, formats, underlying objects, etc. I need to digest the details and have a play, but if there are any general points that warrant wider attention, I’ll post here.