Reputation: 25328
I've got the following HTML:
<table width="100%" border="0" cellpadding="6" cellspacing="1">
<tbody>
<tr>
<td bgcolor="#ffd204" width="40%" nowrap=""><b>Tracking Number:</b></td>
<td bgcolor="#ffffff" width="60%" nowrap="">C123456789012345</td>
</tr>
<!-- ...there could be additional table rows here... -->
<tr>
<td bgcolor="#ffd204" width="40%" nowrap=""><b>Deliver To:</b></td>
<td bgcolor="#ffffff" width="60%" nowrap="">ANYWHERE, NY</td>
</tr>
</tbody>
</table>
Say, for instance I need to pull the ANYWHERE, NY
data. How would I do that using Nokogiri? Or is there something better for traversing this sort of thing where there aren't any CSS selectors to search with?
Upvotes: 2
Views: 801
Reputation: 303168
Since we don't have a CSS class, id
attribute, or other semantic markup to use, we instead look for something that is likely to not change in this document to anchor our search to. In this case, I suspect that the "Deliver To:" label will always come right before the td we want. So:
require 'nokogiri'
html = # Fetch either from http via open-uri's open() or from file via IO.read()
doc = Nokogiri.HTML(html)
delivery = doc.at_xpath '//td[preceding-sibling::td[b="Deliver To:"]]/text()'
p delivery.content
#=> "ANYWHERE, NY"
That XPath expression says:
//
— at any level,td
— find me an element named td
[…]
— but only if…
preceding-sibling::
— it has a preceding siblingtd
— that is an element named td
[…]
— but only if…
b
— it has a child element named b
="Deliver To:"
— whose text content equals this string/text()
— and then find me the child text node(s) of that td
.Because we used at_xpath
instead of xpath
, Nokogiri returns the first matching node it can find—which in this case happens to be the only child text node of that td—instead of an array of nodes.
In case that <td>
can have markup, such as <td…>ANYWHERE,<br>NY</td>
you can modify the expression to omit the trailing /text()
(so that you select only the <td>
itself) and then use the text
method to fetch the combined visible text inside there.
Upvotes: 8
Reputation: 25757
Given that you don't mind some preprocessing, you could do:
lookup = {}
c = Nokogiri::HTML(open("http://..."))
c.search("tr").each do |tr|
cells = tr.search("td")
lookup[cells.first.text.gsub(':', '')] = cells.last.text
end
puts lookup["Tracking Number"]
I didn't test that code so there might be some syntax issues.
Upvotes: 0