Page 1 of 1

gmatch and quotes which aren't quotes

Posted: 28 Jan 2021 07:28
by shoshk
This is a bit long-winded, but thought it might be of interest to somebody else.

First of all, I've discovered another great use for Zerobrane. If you have a section of code which doesn't have any calls to FH built-in functions, you may be able to debug it in the Zerobrane debugger.

In my case, I had a long string -- the result of rt:GetText() from which I wanted to extract the tables. The richtext comes from autotext for plugins.

The string looks like this:

Code: Select all

local sText = [[
<table="2670|15690">
<row> Source |  </row>
<row> Type | Split </row>
<row> Template | Civil Registration </row>
<row> GenericType | Birth Index (gro) </row>
<row> TextFromSource | true </row>
<row> MediaFilenameFormat | StandardMediaFormat( {INDI.Name}, {EN-Fact_Name}, {EN-Suffix}, {INDI.BirthPlace:COUNTRY}, {INDI.BirthPlace:COUNTY}, {INDI.BirthDate} ) </row>
</table>

<table="2685|800|13020">
<row> Fields |  |  </row>
<row> Field | <align="c">DPU | Value </row>
<row> RP-DataProvider | <align="c">D | FindMyPast </row>
<row> TX-Fact_Name | <align="c">D | Birth </row>
<row> TX-Prefix | <align="c">D | BIRT </row>
<row> TX-Suffix | <align="c">D | index (gro) </row>
<row> TX-Database | <align="c">D | England and Wales Births 1837-2006 </row>
<row> EN-DB_Type | <align="c">D | database and images </row>
<row> TX-Citing | <align="c">D | citing General Register Office, “England and Wales Civil Registration Indexes,” London, England </row>
<row> NM-Name_Recorded_1 | <align="c">U | {Principal.Name} </row>
<row> DT-Fact_Date | <align="c">U | {Principal.BirthDate} </row>
</table>

<fs="+3">GRO Birth Index</fs>

<b><fs="+1">Principal</b></fs>
<table="2625|2415|800|12510">
<row> Tabular |  |  |  </row>
<row> Label | Token | <align="c">TLV | Expression </row>
<row> Linked To | {Principal.LinkedTo} | <align="c">L | Principal </row>
<row> Name | {Principal.Name} | <align="c">T | {First name(s)} {Last name} </row>
<row> Birth Date | {Principal.BirthDate} | <align="c">T | Q{Birth quarter} {Birth year} </row>
<row> Birth Place | {Principal.BirthPlace} | <align="c">T | {District} Registration District, {County}, {Country} </row>
</table>

<b><fs="+1">Father</b></fs>
<table="2625|2415|800|12510">
<row> Tabular |  |  |  </row>
<row> Label | Token | <align="c">TLV | Expression </row>
<row> Linked To | {Father.LinkedTo} | <align="c">L | Father </row>
<row> Name | {Father.Name} | <align="c">T | _____ {Last name} </row>
</table>

<b><fs="+1">Mother</b></fs>
<table="2625|2415|800|12510">
<row> Tabular |  |  |  </row>
<row> Label | Token | <align="c">TLV | Expression </row>
<row> Linked To | {Mother.LinkedTo} | <align="c">L | Mother </row>
<row> Name | {Mother.Name} | <align="c">T | {Mother's maiden name} </row>
</table>

<b><fs="+1">Reference</b></fs>
<table="2625|2415|800|12510">
<row> Tabular |  |  |  </row>
<row> Label | Token | <align="c">TLV | Expression </row>
<row> Volume | {Reference.Volume} | <align="c">T | {Volume} </row>
<row> Page | {Reference.Page} | <align="c">T | {Page} </row>
</table>
]]
The code to extract the tables is:

Code: Select all

function rtUtils.ExtractTables(rt, bStripTags)
	local result = {}
	local sText = rt:GetText()
	local tIdx = 0
	local rIdx
	for t in string.gmatch(sText, '<table[%g%s]-</table>') do
		tIdx = tIdx + 1
		result[tIdx] = {}
		rIdx = 0
		for r in string.gmatch(t, '<row>[%g%s]-</row>') do
			rIdx = rIdx + 1
			result[tIdx][rIdx] = rtUtils.ExtractColumns(r, bStripTags)
		end
	end
	return result
end
My problem -- the second table was not being picked up by gmatch. I fooled around quite a bit trying to see what was different about the second table until finally, I noticed that the row for TX-Citing includes quotes which are not quotes. I don't remember what they're called, but they are slanted or curved (depending on font and your editor) instead of up-and-down like regular quotes. :x

Adding the following line before starting the loop solved my problem:

Code: Select all

	sText = sText:gsub('[“”]', '')
It may be possible to handle this with an addition to the pattern in gmatch, but I'm not the greatest at constructing patterns, so I'll stick with my solution for the time being. I'd be happy to substitute a better solution if somebody knows of one.

Re: gmatch and quotes which aren't quotes

Posted: 28 Jan 2021 10:37
by tatewise
Since with [%g%s] you want to match every character, why not use dot . which is specifically intended to match all characters?
See http://www.lua.org/manual/5.3/manual.html#6.4.1 Patterns and 2nd bullet " .: (a dot) represents all characters."
e.g.
string.gmatch(sText, '<table.-</table>')
and
string.gmatch(t, '<row>.-</row>')

BTW: The “ and ” are known as smart quotes and are some of the UTF-8 multi-byte symbols, which is probably why they don't match %g or %s but do match dot . which simply allows each byte to have any value.

Re: gmatch and quotes which aren't quotes

Posted: 28 Jan 2021 10:40
by shoshk
I tried . at one point but it didn't work. The pattern may not have been exactly the same. I'll give your suggestion a try.