tmhumble
tmhumble

Reputation: 5

How to further process a Nokogiri::XML::Element?

I've written a short script in Ruby using Nokogiri to extract some data from a web page. The script works fine, but it is currently returning several nested tags as a single Nokogiri::XML::Element.

The script is as follows:

require 'rubygems'
require 'nokogiri'

#some dummy content that mimics the structure of the web page
dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)

#grab the second div inside of the div entitled div_saadi
result = page.css('div#div_saadi div')[1]

puts result
puts result.class

output is as follows:

<div style="padding:10px 0">
<span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span
</div>
Nokogiri::XML::Element

What I'd like to do is to produce the following output (using something like .each):

content
content outside of the span
morecontent
morecontent outside of the span

Upvotes: 0

Views: 74

Answers (1)

the Tin Man
the Tin Man

Reputation: 160581

You're getting close, but aren't understanding what you're getting back.

Depending on the HTML tag, you could get embedded tags. That's what's happening: You're asking for a single node but it contains additional nodes:

puts page.css('div#div_saadi div')[1].to_html
# >> <div style="padding:10px 0">
# >> <span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div>

text works on both a NodeSet and Node. It just grabs the text of whatever you point it at and returns it and doesn't care how many levels it has to descend to do that:

result = page.css('div#div_saadi div')[1].text
# => "contentcontent outside of the spanmorecontentmorecontent outside of the span"

Instead, you have to iterate over the individual embedded nodes and extract their text:

require 'nokogiri'

dummy_content = '<div id="div_saadi"><div><div style="padding:10px 0"><span class="t4">content</span>content outside of the span<span class="t2">morecontent</span>morecontent outside of the span</div></div></div>'
page = Nokogiri::HTML(dummy_content)

result = page.css('div#div_saadi div')[1]
puts result.children.map(&:text)

# >> content
# >> content outside of the span
# >> morecontent
# >> morecontent outside of the span

children returns all embedded nodes as a NodeSet. Iterating over that returns Nodes, and using text on a particular node at that point will return what you want.

Upvotes: 2

Related Questions