perseverance
perseverance

Reputation: 6602

How do you traverse a HTML document, search, and skip to the next item using Nokogiri?

How do you traverse up to a certain found element and then continue to the next found item? In my example I am trying to search for the first element, grab the text, and then continue until I find the next tag or until I hit a specific tag. The reason I need to also take into account the tag is because I want to do something there.

Html

<table border=0>
  <tr> 
    <td width=180>
      <font size=+1><b>apple</b></font>
    </td>
    <td>Description of an apple</td>
  </tr>
  <tr> 
    <td width=180>
      <font size=+1><b>banana</b></font>
    </td>
    <td>Description of a banana</td>
  </tr>
  <tr> 
    <td><img vspace=4 hspace=0 src="common/dot_clear.gif"></td>
  </tr>
...Then this repeats itself in a similar format

Current scrape.rb

#...
document.at_css("body").traverse do |node|
  #if <font> is found 
    #puts text in font
  #else if <img> is found then 
    #puts img src and continue loop until end of document
end

Thank you!

Upvotes: 2

Views: 510

Answers (2)

kiddorails
kiddorails

Reputation: 13014

Interesting. You basically want to traverse through all the children in your tree and perform some operations on basis of the nodes obtained.

So here is how we can do that:

#Acquiring dummy page
page = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Ruby_%28programming_language%29'))

Now, if you want to start traversing all body elements, we can employ XPath for our rescue. XPath expression: //body//* will give back all the children and grand-children in body.

This would return the array of elements with class Nokogiri::XML::Element

page.xpath('//body//*')
page.xpath('//body//*').first.node_name
#=> "div"

So, you can now traverse on that array and perform your operations:

page.xpath('//body//*').each do |node|
  case node.name
    when 'div' then #do this 
    when 'font' then #do that
  end
end

Upvotes: 1

Vidya
Vidya

Reputation: 30300

Something like this perhaps:

document.at_css("body").traverse do |node|
  if node.name == 'font'
    puts node.content
  elsif node.name == 'img'
    puts node.attribute("src") 
end

Upvotes: 0

Related Questions