Encode XML/HTML/URI/UTF Characters (code snippet)

Description

When creating XML/HTML scripts for web pages, it is both good practice and often essential to encode certain characters into XML/HTML escape sequences and UTF-8 multi-byte codes. Also any anchor tag hyperlinks should use URI encodings.

Plugins use Code Page 1252 characters. These are encoded by the StrEncode() function, which employs the string library gsub() to encode each required character.

The strPattern defines an Lua Pattern single character class to select the characters to be encoded. Examples are the sets "[€-ÿ]" to select all UTF-8 characters, and "[%c\"&'<>€-ÿ]" to select all control characters, plus " & ' < > and all UTF-8 characters for XML/HTML encoding.

The encodings are defined by the TblCP1252 lookup table indexed by Code Page 1252 character codes from "\000" to "\255" i.e. string.char(0x00) to string.char(0xFF). Entries in this table are only applied if selected by the strPattern. Also if a selected character has no entry in the table, then it is retained untranslated.

This technique is both fast and flexible. It can easily be adapted to perform many forms of character encoding. Examples are given for encoding just UTF-8 characters, and for XML/HTML/UTF-8, and for URI/URL.

For XML/HTML/UTF-8 some layout control characters are translated to a line break <br> tag. If such tags are encoded again, then they become &lt;br&gt; that must be translated back to <br>.

For URI/URL some encodings must be temporarily changed, and the <br> tag replaced with the %0A newline encoding.

This needs updating to cope with Unicode characters and UTF-8 encoded Plugins introduce by fh V6.

Requires: None

Code

EncodeChars.fh_lua
--[[
@Title:		Encode XML/HTML/URI/UTF Characters
@Author:	Mike Tate
@LastUpdated:	June 2013
@Version:	2.0
@Description:	Encode characters into XML/HTML/URI/UTF codes.
]]
 
-- Code Page 1252 to XML/HTML/URI/UTF8 encodings defined below
	TblCP1252 = { }
 
-- Encode characters according to gsub pattern & lookup table --
function StrEncode(strText,strPattern)
	strText = (strText or ""):gsub(strPattern,TblCP1252)
	return strText
end -- function StrEncode
 
-- Encode CP1252 characters into UTF8 codes --
function StrCP1252_UTF(strText)
	strText = StrEncode(strText,"[€-ÿ]")
	return strText
end -- function StrCP1252_UTF
 
-- Encode CP1252 characters into XML/HTML/UTF8 codes --
function StrCP1252_XML(strText)
	strText = StrEncode(strText,"[%c\"&'<>€-ÿ]"):gsub("&lt;br&gt;","<br>")
	return strText
end -- function StrCP1252_XML
 
-- Encode CP1252 characters into URI codes --
function StrCP1252_URI(strText)
	TblCP1252["\t"] = "+"	-- Temporarily use URI coding in place of XML/HTML coding
	TblCP1252['"'] = "%22"
	TblCP1252["&"] = "%26"
	TblCP1252["'"] = "%27"
	TblCP1252["<"] = "%3C"
	TblCP1252[">"] = "%3E"
	strText = StrEncode(strText,"[%c%s%p€-ÿ]"):gsub("<br>","%%0A")
	TblCP1252["\t"] = nil
	TblCP1252['"'] = "&quot;"
	TblCP1252["&"] = "&amp;"
	TblCP1252["'"] = "&apos;"
	TblCP1252["<"] = "&lt;"
	TblCP1252[">"] = "&gt;"
	return strText
end -- function StrCP1252_URI
 
-- Code Page 1252 to XML/HTML/URI/UTF8 encodings: http://en.wikipedia.org/wiki/Windows-1252
 
-- Control characters "\000" to "\031" or "[%c]" are mostly disallowed, or replaced with "<br>"
	TblCP1252["\000"] = ""		-- NUL
	TblCP1252["\001"] = ""		-- SOH
	TblCP1252["\002"] = ""		-- STX
	TblCP1252["\003"] = ""		-- ETX
	TblCP1252["\004"] = ""		-- EOT
	TblCP1252["\005"] = ""		-- ENQ
	TblCP1252["\006"] = ""		-- ACK
	TblCP1252["\a"] = ""		-- BEL
	TblCP1252["\b"] = ""		-- BS
	TblCP1252["\t"] = nil		-- HT	"\t" treated as space in XML/HTML
	TblCP1252["\n"] = "<br>"	-- LF	"\n" treated as space in XML/HTML	"%0A" allowed in URI
	TblCP1252["\v"] = "<br>"	-- VT	<br> is newline break in XML/HTML
	TblCP1252["\f"] = "<br>"	-- FF	<br> is newline break in XML/HTML
	TblCP1252["\r"] = "<br>"	-- CR	"\r" treated as space in XML/HTML	"%0D" allowed in URI
	TblCP1252["\014"] = ""		-- SO
	TblCP1252["\015"] = ""		-- SI
	TblCP1252["\016"] = ""		-- DLE
	TblCP1252["\017"] = ""		-- DC1
	TblCP1252["\018"] = ""		-- DC2
	TblCP1252["\019"] = ""		-- DC3
	TblCP1252["\020"] = ""		-- DC4
	TblCP1252["\021"] = ""		-- NAK
	TblCP1252["\022"] = ""		-- SYN
	TblCP1252["\023"] = ""		-- ETB
	TblCP1252["\024"] = ""		-- CAN
	TblCP1252["\025"] = ""		-- EM
	TblCP1252["\026"] = ""		-- SUB
	TblCP1252["\027"] = ""		-- ESC
	TblCP1252["\028"] = ""		-- FS
	TblCP1252["\029"] = ""		-- GS
	TblCP1252["\030"] = ""		-- RS
	TblCP1252["\031"] = ""		-- US
-- URI "[%s%p]" encodings: http://en.wikipedia.org/wiki/URL and http://en.wikipedia.org/wiki/Percent-encoding
	TblCP1252[" "] = "+"		-- "%20"	Space
	TblCP1252["!"] = "%21"
	TblCP1252["#"] = "%23"
	TblCP1252["$"] = "%24"
	TblCP1252["%"] = "%25"
	TblCP1252["&"] = "%26"		-- "&" and "'" are overridden by XML/HTML encodings below
	TblCP1252["'"] = "%27"
	TblCP1252["("] = "%28"
	TblCP1252[")"] = "%29"
	TblCP1252["*"] = "%2A"
	TblCP1252["+"] = "%2B"
	TblCP1252[","] = "%2C"
	TblCP1252["/"] = "%2F"
	TblCP1252[":"] = "%3A"
	TblCP1252[";"] = "%3B"
	TblCP1252["="] = "%3D"
	TblCP1252["?"] = "%3F"
	TblCP1252["@"] = "%40"
	TblCP1252["["] = "%5B"
	TblCP1252["]"] = "%5D"
-- XML/HTML "[\"&'<>]" encodings: http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/HTML
	TblCP1252['"'] = "&quot;"
	TblCP1252["&"] = "&amp;"
	TblCP1252["'"] = "&apos;"
	TblCP1252["<"] = "&lt;"
	TblCP1252[">"] = "&gt;"
	TblCP1252["\127"] = ""		-- DEL
-- 	UTF-8 "[€-ÿ]" encodings: http://en.wikipedia.org/wiki/UTF-8
	-- Take CP1252 Unicode and encode using UTF-8 scheme
	TblCP1252["€"] = string.char(0xE2,0x82,0xAC)	-- "&euro;"
	TblCP1252["\129"] = ""		-- Undefined
	TblCP1252["‚"] = string.char(0xE2,0x80,0x9A)
	TblCP1252["ƒ"] = string.char(0xC6,0x92)
	TblCP1252["„"] = string.char(0xE2,0x80,0x9E)
	TblCP1252["…"] = string.char(0xE2,0x80,0xA6)
	TblCP1252["†"] = string.char(0xE2,0x80,0xA0)
	TblCP1252["‡"] = string.char(0xE2,0x80,0xA1)
	TblCP1252["ˆ"] = string.char(0xCB,0x86)
	TblCP1252["‰"] = string.char(0xE2,0x80,0xB0)
	TblCP1252["Š"] = string.char(0xC5,0xA0)
	TblCP1252["‹"] = string.char(0xE2,0x80,0xB9)
	TblCP1252["Œ"] = string.char(0xC5,0x92)
	TblCP1252["\141"] = ""		-- Undefined
	TblCP1252["Ž"] = string.char(0xC5,0xBD)
	TblCP1252["\143"] = ""		-- Undefined
	TblCP1252["\144"] = ""		-- Undefined
	TblCP1252["‘"] = string.char(0xE2,0x80,0x98)
	TblCP1252["’"] = string.char(0xE2,0x80,0x99)
	TblCP1252["“"] = string.char(0xE2,0x80,0x9C)
	TblCP1252["”"] = string.char(0xE2,0x80,0x9D)
	TblCP1252["•"] = string.char(0xE2,0x80,0xA2)
	TblCP1252["–"] = string.char(0xE2,0x80,0x93)
	TblCP1252["—"] = string.char(0xE2,0x80,0x94)
	TblCP1252["\152"] = string.char(0xCB,0x9C)	-- Small Tilde
	TblCP1252["™"] = string.char(0xE2,0x84,0xA2)
	TblCP1252["š"] = string.char(0xC5,0xA1)
	TblCP1252["›"] = string.char(0xE2,0x80,0xBA)
	TblCP1252["œ"] = string.char(0xC5,0x93)
	TblCP1252["\157"] = ""		-- Undefined
	TblCP1252["ž"] = string.char(0xC5,0xBE)
	TblCP1252["Ÿ"] = string.char(0xC5,0xB8)
	TblCP1252["\160"] = string.char(0xC2,0xA0)	-- "&nbsp;"	No Break Space
	TblCP1252["¡"] = string.char(0xC2,0xA1)		-- "&iexcl;"
	TblCP1252["¢"] = string.char(0xC2,0xA2)		-- "&cent;"
	TblCP1252["£"] = string.char(0xC2,0xA3)		-- "&pound;"
	TblCP1252["¤"] = string.char(0xC2,0xA4)		-- "&curren;"
	TblCP1252["¥"] = string.char(0xC2,0xA5)		-- "&yen;"
	TblCP1252["¦"] = string.char(0xC2,0xA6)
	TblCP1252["§"] = string.char(0xC2,0xA7)
	TblCP1252["¨"] = string.char(0xC2,0xA8)
	TblCP1252["©"] = string.char(0xC2,0xA9)
	TblCP1252["ª"] = string.char(0xC2,0xAA)
	TblCP1252["«"] = string.char(0xC2,0xAB)
	TblCP1252["¬"] = string.char(0xC2,0xAC)
	TblCP1252["­"] = string.char(0xC2,0xAD)		-- "&shy;"	Soft Hyphen
	TblCP1252["®"] = string.char(0xC2,0xAE)
	TblCP1252["¯"] = string.char(0xC2,0xAF)
	TblCP1252["°"] = string.char(0xC2,0xB0)
	TblCP1252["±"] = string.char(0xC2,0xB1)
	TblCP1252["²"] = string.char(0xC2,0xB2)
	TblCP1252["³"] = string.char(0xC2,0xB3)
	TblCP1252["´"] = string.char(0xC2,0xB4)
	TblCP1252["µ"] = string.char(0xC2,0xB5)
	TblCP1252["¶"] = string.char(0xC2,0xB6)
	TblCP1252["•"] = string.char(0xC2,0xB7)
	TblCP1252["¸"] = string.char(0xC2,0xB8)
	TblCP1252["¹"] = string.char(0xC2,0xB9)
	TblCP1252["º"] = string.char(0xC2,0xBA)
	TblCP1252["»"] = string.char(0xC2,0xBB)
	TblCP1252["¼"] = string.char(0xC2,0xBC)
	TblCP1252["½"] = string.char(0xC2,0xBD)
	TblCP1252["¾"] = string.char(0xC2,0xBE)
	TblCP1252["¿"] = string.char(0xC2,0xBF)
	TblCP1252["À"] = string.char(0xC3,0x80)
	TblCP1252["Á"] = string.char(0xC3,0x81)
	TblCP1252["Â"] = string.char(0xC3,0x82)
	TblCP1252["Ã"] = string.char(0xC3,0x83)
	TblCP1252["Ä"] = string.char(0xC3,0x84)
	TblCP1252["Å"] = string.char(0xC3,0x85)
	TblCP1252["Æ"] = string.char(0xC3,0x86)
	TblCP1252["Ç"] = string.char(0xC3,0x87)
	TblCP1252["È"] = string.char(0xC3,0x88)
	TblCP1252["É"] = string.char(0xC3,0x89)
	TblCP1252["Ê"] = string.char(0xC3,0x8A)
	TblCP1252["Ë"] = string.char(0xC3,0x8B)
	TblCP1252["Ì"] = string.char(0xC3,0x8C)
	TblCP1252["Í"] = string.char(0xC3,0x8D)
	TblCP1252["Î"] = string.char(0xC3,0x8E)
	TblCP1252["Ï"] = string.char(0xC3,0x8F)
	TblCP1252["Ð"] = string.char(0xC3,0x90)
	TblCP1252["Ñ"] = string.char(0xC3,0x91)
	TblCP1252["Ò"] = string.char(0xC3,0x92)
	TblCP1252["Ó"] = string.char(0xC3,0x93)
	TblCP1252["Ô"] = string.char(0xC3,0x94)
	TblCP1252["Õ"] = string.char(0xC3,0x95)
	TblCP1252["Ö"] = string.char(0xC3,0x96)
	TblCP1252["×"] = string.char(0xC3,0x97)
	TblCP1252["Ø"] = string.char(0xC3,0x98)
	TblCP1252["Ù"] = string.char(0xC3,0x99)
	TblCP1252["Ú"] = string.char(0xC3,0x9A)
	TblCP1252["Û"] = string.char(0xC3,0x9B)
	TblCP1252["Ü"] = string.char(0xC3,0x9C)
	TblCP1252["Ý"] = string.char(0xC3,0x9D)
	TblCP1252["Þ"] = string.char(0xC3,0x9E)
	TblCP1252["ß"] = string.char(0xC3,0x9F)
	TblCP1252["à"] = string.char(0xC3,0xA0)
	TblCP1252["á"] = string.char(0xC3,0xA1)
	TblCP1252["â"] = string.char(0xC3,0xA2)
	TblCP1252["ã"] = string.char(0xC3,0xA3)
	TblCP1252["ä"] = string.char(0xC3,0xA4)
	TblCP1252["å"] = string.char(0xC3,0xA5)
	TblCP1252["æ"] = string.char(0xC3,0xA6)
	TblCP1252["ç"] = string.char(0xC3,0xA7)
	TblCP1252["è"] = string.char(0xC3,0xA8)
	TblCP1252["é"] = string.char(0xC3,0xA9)
	TblCP1252["ê"] = string.char(0xC3,0xAA)
	TblCP1252["ë"] = string.char(0xC3,0xAB)
	TblCP1252["ì"] = string.char(0xC3,0xAC)
	TblCP1252["í"] = string.char(0xC3,0xAD)
	TblCP1252["î"] = string.char(0xC3,0xAE)
	TblCP1252["ï"] = string.char(0xC3,0xAF)
	TblCP1252["ð"] = string.char(0xC3,0xB0)
	TblCP1252["ñ"] = string.char(0xC3,0xB1)
	TblCP1252["ò"] = string.char(0xC3,0xB2)
	TblCP1252["ó"] = string.char(0xC3,0xB3)
	TblCP1252["ô"] = string.char(0xC3,0xB4)
	TblCP1252["õ"] = string.char(0xC3,0xB5)
	TblCP1252["ö"] = string.char(0xC3,0xB6)
	TblCP1252["÷"] = string.char(0xC3,0xB7)
	TblCP1252["ø"] = string.char(0xC3,0xB8)
	TblCP1252["ù"] = string.char(0xC3,0xB9)
	TblCP1252["ú"] = string.char(0xC3,0xBA)
	TblCP1252["û"] = string.char(0xC3,0xBB)
	TblCP1252["ü"] = string.char(0xC3,0xBC)
	TblCP1252["ý"] = string.char(0xC3,0xBD)
	TblCP1252["þ"] = string.char(0xC3,0xBE)
	TblCP1252["ÿ"] = string.char(0xC3,0xBF)

Usage

This shows how various CP1252 characters are encoded into XML/HTML/URI/UTF8 compatible codes.

strText = "\n\t\r !\"#$%&'()*+,-./0123456789:;<=>?@AZ[\\]^_`az{|}~ Euro=€ Elipsis=… Last=ÿ"
print(StrCP1252_UTF(strText).."\n")
print(StrCP1252_XML(strText).."\n")
print(StrCP1252_URI(strText).."\n")

produces:

	 !"#$%&'()*+,-./0123456789:;<=>?@AZ[\]^_`az{|}~ Euro=€ Elipsis=… Last=ÿ

<br>	<br> !&quot;#$%&amp;&apos;()*+,-./0123456789:;&lt;=&gt;?@AZ[\]^_`az{|}~ Euro=€ Elipsis=… Last=ÿ

%0A+%0A+%21%22%23%24%25%26%27%28%29%2A%2B%2C-.%2F0123456789%3A%3B%3C%3D%3E%3F%40AZ%5B\%5D^_`az{|}~+Euro%3D€+Elipsis%3D…+Last%3Dÿ