Traverse HTML with no CSS class using Nokogiri?

Question

I've got the following HTML:


  
    
      Tracking Number:
      C123456789012345
    
    
    
      Deliver To:
      ANYWHERE, NY

Say, for instance I need to pull the ANYWHERE, NY data. How would I do that using Nokogiri? Or is there something better for traversing this sort of thing where there aren't any CSS selectors to search with?

Phrogz · Accepted Answer

Since we don't have a CSS class, id attribute, or other semantic markup to use, we instead look for something that is likely to not change in this document to anchor our search to. In this case, I suspect that the "Deliver To:" label will always come right before the td we want. So:

require 'nokogiri'

html = # Fetch either from http via open-uri's open() or from file via IO.read()
doc = Nokogiri.HTML(html) 
delivery = doc.at_xpath '//td[preceding-sibling::td[b="Deliver To:"]]/text()'    
p delivery.content
#=> "ANYWHERE, NY"

That XPath expression says:

// — at any level,
td — find me an element named td
[…] — but only if…
- preceding-sibling:: — it has a preceding sibling
- td — that is an element named td
- […] — but only if…
  - b — it has a child element named b
  - ="Deliver To:" — whose text content equals this string
/text() — and then find me the child text node(s) of that td.

Because we used at_xpath instead of xpath, Nokogiri returns the first matching node it can find—which in this case happens to be the only child text node of that td—instead of an array of nodes.

In case that can have markup, such as ANYWHERE, NY you can modify the expression to omit the trailing /text() (so that you select only the itself) and then use the text method to fetch the combined visible text inside there.

Traverse HTML with no CSS class using Nokogiri?

Answers (2)

Related Questions