Simone Luu
Simone Luu

Reputation: 107

Lua pattern <li> html tags

I want to extract the values within li elements and store them into variables.

Example:

<li>Male</li><li>Hustisford, WI</li><li>United States</li>

However, it could also be like this:

<li>Hustisford, WI</li><li>United States</li>

or no

I started with this:

author_origin = string.gsub(string.gsub(htmlcode,"<li>","#"),"</li>","|")

author_gender, author_orig_city, author_orig_country = string.match(author_origin,"#(.-)|#(.-)|#(.-)|") 

=> this worked for the first example but not for the other cases.

I thought it should be something like this but it didn't work:

author_gender, author_orig_city, author_orig_country = string.match(author_origin,"[#]?(.-?)[|]?[#]?(.-?)[|]?[#]?(.-?)[|]?")

Upvotes: 1

Views: 1100

Answers (3)

Craig Barnes
Craig Barnes

Reputation: 146

If you need to parse unpredictable HTML and don't mind depending on a library, you could use lua-gumbo:

local gumbo = require "gumbo"
local input = "<li>Male</li><li>Hustisford, WI</li><li>United States</li>"
local document = gumbo.parse(input)

local elements = document:getElementsByTagName("li")
local gender = elements[1].textContent
local city = elements[2].textContent
local country = elements[3].textContent

print(gender, city, country)

Upvotes: 0

Etan Reisner
Etan Reisner

Reputation: 81052

You can avoid needing multiple patterns by simply grabbing everything that matches your criteria and then figuring out what you have at the end. Something like this.

function extract(s)
    local t = {}
    for v in s:gmatch("<li>(.-)</li>") do
        t[#t + 1] = v
    end

    if #t == 3 then
        return (unpack or table.unpack)(t)
    end

    return nil,(unpack or table.unpack)(t)
end

author_gender, author_orig_city, author_orig_country = extract("<li>Male</li><li>Hustisford, WI</li><li>United States</li>")
print(author_gender, author_orig_city, author_orig_country)
author_gender, author_orig_city, author_orig_country = extract('<li>Hustisford, WI</li><li>United States</li>')
print(author_gender, author_orig_city, author_orig_country)

Upvotes: 3

lhf
lhf

Reputation: 72412

You can't do it with a single patten. You need two. First try for three fields. If it fails, try for two fields. And you don't need to replace the HTML tags with others characters.

author_gender, author_orig_city, author_orig_country = string.match(author_origin,"<li>(.-)</li><li>(.-)</li><li>(.-)</li>")
if author_gender==nil then
   author_orig_city, author_orig_country = string.match(author_origin,"<li>(.-)</li><li>(.-)</li>")
end

Upvotes: 2

Related Questions