nevan king
nevan king

Reputation: 113747

Target text without tags using Nokogiri

I have some very bare HTML that I'm trying to parse using Nokogiri (on Ruby):

<span>Address</span><br />
123 Main Street<br />
Sometown<br />
<span>Telephone</span><br />
<a href="tel:212-555-555">212-555-555</a><br />

    <span>Hours</span><br />
    M-F: 8:00-21:00<br />
       Sat-Sun: 8:00-21:00<br />
<hr />

The only tag I have is a surrounding <div> for the page content. Each of the things I want is preceded by a <span>Address</span> type tag. It can be followed by another span or a hr at the end.

I'd like to end up with the address ("123 Main Street\nSometown"), phone number ("212-555-555") and opening hours as separate fields.

Is there a way to get the information out using Nokogiri, or would it be easier to do this with regular expressions?

Upvotes: 3

Views: 1757

Answers (2)

A.D.
A.D.

Reputation: 4614

I was thinking (rather learning) about xpath:

d.xpath("span[2]/preceding-sibling::text()").each {|i| puts i}
# 123 Main Street
# Sometown

d.xpath("a/text()").text
# "212-555-555"

d.xpath("span[3]/following::text()").text.strip
# "M-F: 8:00-21:00       Sat-Sun: 8:00-21:00"

The first starts with second span and select text() which is before.
You can try another approach here - start with first span, select text() and end up with predicate which checks for next span.

d.xpath("span[1]/following::text()[following-sibling::span]").each {|i| puts i}
# 123 Main Street
# Sometown

If the document has more spans, you can start with the right ones:
span[x] could be substituted by span[contains(.,'text-in-span')]
span[3] == span[contains(.,'Hours')]

Correct me, if something is really wrong.

Upvotes: 0

maerics
maerics

Reputation: 156384

Using Nokogiri and XPath you could do something like this:

def extract_span_data(html)
  doc = Nokogiri::HTML(html)
  doc.xpath("//span").reduce({}) do |memo, span|
    text = ''
    node = span.next_sibling
    while node && (node.name != 'span')
      text += node.text
      node = node.next_sibling
    end
    memo[span.text] = text.strip
    memo
  end
end

extract_span_data(html_string)
# {
#   "Address"   => "123 Main Street\nSometown",
#   "Telephone" => "212-555-555",
#   "Hours"     => "M-F: 8:00-21:00\n       Sat-Sun: 8:00-21:00"
# }

Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad ideaTM.)

Upvotes: 5

Related Questions