Reputation: 107
I want to extract the values within li elements and store them into variables.
Example:
<li>Male</li><li>Hustisford, WI</li><li>United States</li>
However, it could also be like this:
<li>Hustisford, WI</li><li>United States</li>
or no
I started with this:
author_origin = string.gsub(string.gsub(htmlcode,"<li>","#"),"</li>","|")
author_gender, author_orig_city, author_orig_country = string.match(author_origin,"#(.-)|#(.-)|#(.-)|")
=> this worked for the first example but not for the other cases.
I thought it should be something like this but it didn't work:
author_gender, author_orig_city, author_orig_country = string.match(author_origin,"[#]?(.-?)[|]?[#]?(.-?)[|]?[#]?(.-?)[|]?")
Upvotes: 1
Views: 1100
Reputation: 146
If you need to parse unpredictable HTML and don't mind depending on a library, you could use lua-gumbo:
local gumbo = require "gumbo"
local input = "<li>Male</li><li>Hustisford, WI</li><li>United States</li>"
local document = gumbo.parse(input)
local elements = document:getElementsByTagName("li")
local gender = elements[1].textContent
local city = elements[2].textContent
local country = elements[3].textContent
print(gender, city, country)
Upvotes: 0
Reputation: 81052
You can avoid needing multiple patterns by simply grabbing everything that matches your criteria and then figuring out what you have at the end. Something like this.
function extract(s)
local t = {}
for v in s:gmatch("<li>(.-)</li>") do
t[#t + 1] = v
end
if #t == 3 then
return (unpack or table.unpack)(t)
end
return nil,(unpack or table.unpack)(t)
end
author_gender, author_orig_city, author_orig_country = extract("<li>Male</li><li>Hustisford, WI</li><li>United States</li>")
print(author_gender, author_orig_city, author_orig_country)
author_gender, author_orig_city, author_orig_country = extract('<li>Hustisford, WI</li><li>United States</li>')
print(author_gender, author_orig_city, author_orig_country)
Upvotes: 3
Reputation: 72412
You can't do it with a single patten. You need two. First try for three fields. If it fails, try for two fields. And you don't need to replace the HTML tags with others characters.
author_gender, author_orig_city, author_orig_country = string.match(author_origin,"<li>(.-)</li><li>(.-)</li><li>(.-)</li>")
if author_gender==nil then
author_orig_city, author_orig_country = string.match(author_origin,"<li>(.-)</li><li>(.-)</li>")
end
Upvotes: 2