Target text without tags using Nokogiri

Question

I have some very bare HTML that I'm trying to parse using Nokogiri (on Ruby):

Address

123 Main Street

Sometown

Telephone

212-555-555


    Hours

    M-F: 8:00-21:00

       Sat-Sun: 8:00-21:00

The only tag I have is a surrounding

for the page content. Each of the things I want is preceded by a Address type tag. It can be followed by another span or a hr at the end.

I'd like to end up with the address ("123 Main Street Sometown"), phone number ("212-555-555") and opening hours as separate fields.

Is there a way to get the information out using Nokogiri, or would it be easier to do this with regular expressions?

maerics · Accepted Answer

Using Nokogiri and XPath you could do something like this:

def extract_span_data(html)
  doc = Nokogiri::HTML(html)
  doc.xpath("//span").reduce({}) do |memo, span|
    text = ''
    node = span.next_sibling
    while node && (node.name != 'span')
      text += node.text
      node = node.next_sibling
    end
    memo[span.text] = text.strip
    memo
  end
end

extract_span_data(html_string)
# {
#   "Address"   => "123 Main Street
Sometown",
#   "Telephone" => "212-555-555",
#   "Hours"     => "M-F: 8:00-21:00
       Sat-Sun: 8:00-21:00"
# }

Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad idea^TM.)

Target text without tags using Nokogiri

Answers (2)

Related Questions