Reputation: 2199
I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
Upvotes: 0
Views: 1076
Reputation: 79733
The children
method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements
method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children
Upvotes: 1