Reputation: 113747
I have some very bare HTML that I'm trying to parse using Nokogiri (on Ruby):
<span>Address</span><br />
123 Main Street<br />
Sometown<br />
<span>Telephone</span><br />
<a href="tel:212-555-555">212-555-555</a><br />
<span>Hours</span><br />
M-F: 8:00-21:00<br />
Sat-Sun: 8:00-21:00<br />
<hr />
The only tag I have is a surrounding <div>
for the page content. Each of the things I want is preceded by a <span>Address</span>
type tag. It can be followed by another span
or a hr
at the end.
I'd like to end up with the address ("123 Main Street\nSometown"), phone number ("212-555-555") and opening hours as separate fields.
Is there a way to get the information out using Nokogiri, or would it be easier to do this with regular expressions?
Upvotes: 3
Views: 1757
Reputation: 4614
I was thinking (rather learning) about xpath:
d.xpath("span[2]/preceding-sibling::text()").each {|i| puts i}
# 123 Main Street
# Sometown
d.xpath("a/text()").text
# "212-555-555"
d.xpath("span[3]/following::text()").text.strip
# "M-F: 8:00-21:00 Sat-Sun: 8:00-21:00"
The first starts with second span and select text() which is before.
You can try another approach here - start with first span, select text() and end up with predicate which checks for next span.
d.xpath("span[1]/following::text()[following-sibling::span]").each {|i| puts i}
# 123 Main Street
# Sometown
If the document has more spans, you can start with the right ones:
span[x]
could be substituted by span[contains(.,'text-in-span')]
span[3]
== span[contains(.,'Hours')]
Correct me, if something is really wrong.
Upvotes: 0
Reputation: 156384
Using Nokogiri and XPath you could do something like this:
def extract_span_data(html)
doc = Nokogiri::HTML(html)
doc.xpath("//span").reduce({}) do |memo, span|
text = ''
node = span.next_sibling
while node && (node.name != 'span')
text += node.text
node = node.next_sibling
end
memo[span.text] = text.strip
memo
end
end
extract_span_data(html_string)
# {
# "Address" => "123 Main Street\nSometown",
# "Telephone" => "212-555-555",
# "Hours" => "M-F: 8:00-21:00\n Sat-Sun: 8:00-21:00"
# }
Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad ideaTM.)
Upvotes: 5