Nick Faraday
Nick Faraday

Reputation: 568

How do I search for "text" then traverse the DOM from the found node?

I have webpage that I need to scrape some data from. The problem is, each page may or may not have specific data, or it may have extra data above or below it in the DOM, and there is no CSS ids to speak of.

Typically I could use either CSS ids or XPath to get to the node I'm looking for. I don't have that option in this case. What I'm trying to do is search for the "label" text then grab the data in the next <TD> node:

<tr> 
    <td><b>Name:</b></td> 
    <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td> 
</tr>

In the above HTML, I would search for:

doc.search("[text()*='Name:']")

to get the node just before the data I need, but I'm not sure how to navigate from there.

Upvotes: 16

Views: 8900

Answers (3)

fearless_fool
fearless_fool

Reputation: 35159

You can do the entire search in a single statement using xpath's parent / following_sibling syntax:

>> require 'nokogiri' 
=> true   
>> html = <<HTML
<tr> 
    <td><b>Name:</b></td> 
    <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small></td> 
</tr>
HTML
>> doc = Nokogiri::HTML(html)

>> doc.at_xpath("//*[text()='Name:']/../following-sibling::*").to_s
=> "<td>Joe Smith <small><a href=\"/Joe\"><img src=\"/joe.png\"></a></small>\n</td>"

Upvotes: 0

the Tin Man
the Tin Man

Reputation: 160551

require 'nokogiri'

html = '
<html>
  <body>
    <p>foo</p>
    this text
    <p>bar</p>
  </body>
</html>
'

doc = Nokogiri::HTML(html)
doc.at('p:contains("foo")').next_sibling.text.strip
=> "this text"

Upvotes: 2

Michelle Tilley
Michelle Tilley

Reputation: 159105

next_element is probably the method you're looking for.

require 'nokogiri'

data = File.read "html.htm"

doc  = Nokogiri::HTML data

els  = doc.search "[text()*='Name:']"
el   = els.first

puts "Found element:"
puts el
puts

puts "Parent element:"
puts el.parent
puts

puts "Parent's next_element():"
puts el.parent.next_element

# Output:
#
# Found element:
# <b>Name:</b>
#
# Parent element:
# <td> 
#     <b>Name:</b>
# </td>
#
# Parent's next_element():
# <td>Joe Smith <small><a href="/Joe"><img src="/joe.png"></a></small>
# </td>

Note that since the text is inside <b></b> tags, you have to go up a level (to the found element's parent <td>) before you can get to the next sibling. If the HTML structure isn't stable, you'd have to find the first parent that is a <td> and go from there.

Upvotes: 25

Related Questions