Shpigford
Shpigford

Reputation: 25328

Traverse HTML with no CSS class using Nokogiri?

I've got the following HTML:

<table width="100%" border="0" cellpadding="6" cellspacing="1">
  <tbody>
    <tr>
      <td bgcolor="#ffd204" width="40%" nowrap=""><b>Tracking Number:</b></td>
      <td bgcolor="#ffffff" width="60%" nowrap="">C123456789012345</td>
    </tr>
    <!-- ...there could be additional table rows here... -->
    <tr>
      <td bgcolor="#ffd204" width="40%" nowrap=""><b>Deliver To:</b></td>
      <td bgcolor="#ffffff" width="60%" nowrap="">ANYWHERE, NY</td>
    </tr>
  </tbody>
</table>

Say, for instance I need to pull the ANYWHERE, NY data. How would I do that using Nokogiri? Or is there something better for traversing this sort of thing where there aren't any CSS selectors to search with?

Upvotes: 2

Views: 801

Answers (2)

Phrogz
Phrogz

Reputation: 303168

Since we don't have a CSS class, id attribute, or other semantic markup to use, we instead look for something that is likely to not change in this document to anchor our search to. In this case, I suspect that the "Deliver To:" label will always come right before the td we want. So:

require 'nokogiri'

html = # Fetch either from http via open-uri's open() or from file via IO.read()
doc = Nokogiri.HTML(html) 
delivery = doc.at_xpath '//td[preceding-sibling::td[b="Deliver To:"]]/text()'    
p delivery.content
#=> "ANYWHERE, NY"

That XPath expression says:

  • // — at any level,
  • td — find me an element named td
  • […] — but only if…
    • preceding-sibling:: — it has a preceding sibling
    • td — that is an element named td
    • […] — but only if…
      • b — it has a child element named b
      • ="Deliver To:" — whose text content equals this string
  • /text() — and then find me the child text node(s) of that td.

Because we used at_xpath instead of xpath, Nokogiri returns the first matching node it can find—which in this case happens to be the only child text node of that td—instead of an array of nodes.

In case that <td> can have markup, such as <td…>ANYWHERE,<br>NY</td> you can modify the expression to omit the trailing /text() (so that you select only the <td> itself) and then use the text method to fetch the combined visible text inside there.

Upvotes: 8

moritz
moritz

Reputation: 25757

Given that you don't mind some preprocessing, you could do:

lookup = {}
c = Nokogiri::HTML(open("http://..."))
c.search("tr").each do |tr|
  cells = tr.search("td")
  lookup[cells.first.text.gsub(':', '')] = cells.last.text
end

puts lookup["Tracking Number"]

I didn't test that code so there might be some syntax issues.

Upvotes: 0

Related Questions