Nokogiri HTML Nested Elements Extract Class and Text

Question

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:

html = "
  
    
         
      
         Plains
          Trains
           Automobiles
      
      
        
          Love
           First
            Sight
      
    
  
"

Notice that the class names are random. Notice also that there is whitespace and tabs in the html.

I want to extract the children and end up with a hash like so:

page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
  children = divs.children
  children.each do |child|
    itemhash[child['class']] = child.text
  end
end

Result should be similar to:

 {"r12321"=>"Plains", "as124223"=>"Trains", "qwss12311232"=>"Automobiles", "lknoijojkljl98799999"=>"Love", "vssdfsd0809809"=>"First", "awefsaf98098"=>"Sight"}

But I'm ending up with a mess like this:

 {nil=>"\n\t\t\t\t\t\t", "r12321"=>"Plains", nil=>" ", "as124223"=>"Trains", "qwss12311232"=>"Automobiles", nil=>"\n\t\t\t\t\t\t", "lknoijojkljl98799999"=>"Love", nil=>" ", "vssdfsd0809809"=>"First", "awefsaf98098"=>"Sight"}

This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.

How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?

P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.

Nokogiri HTML Nested Elements Extract Class and Text

Answers (1)

Related Questions