t56k
t56k

Reputation: 6981

Why is Nokogiri giving me multiple results?

I'm trying to parse an HTML string with Nokogiri but I'm getting some recursion issue and I can't figure out why.

Given these commands:

string = <h3>Lancers were arranged.&nbsp;</h3>
         <div>Gabriel found himself partnered with Miss Ivors.</div>
         <br>She leaned. He lit a <b>candle</b>.
         They followed him in silence, their feet falling in soft thuds on the thickly carpeted stairs.<br>

body = Nokogiri::HTML(string)
result = []
body.traverse { |node| result << node }

I would expect an array of the above elements. Instead I get given this:

[#<Nokogiri::XML::DTD:0x3fde1f3d5274 name="html">
#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">
#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">
#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>
#<Nokogiri::XML::Element:0x3fde1ea575e4 name="html" children=[#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>]>
#<Nokogiri::HTML::Document:0x3fde1f3d6084 name="document" children=[#<Nokogiri::XML::DTD:0x3fde1f3d5274 name="html">
#<Nokogiri::XML::Element:0x3fde1ea575e4 name="html" children=[#<Nokogiri::XML::Element:0x3fde1ea56f7c name="body" children=[#<Nokogiri::XML::Element:0x3fde1ea56a68 name="h3" children=[#<Nokogiri::XML::Text:0x3fde1e88d330 "Lancers were arranged. ">]>
#<Nokogiri::XML::Element:0x3fde1e88cd04 name="div" children=[#<Nokogiri::XML::Text:0x3fde1e88c764 "Gabriel found himself partnered with Miss Ivors.">]>
#<Nokogiri::XML::Element:0x3fde1e88c0fc name="br">
#<Nokogiri::XML::Text:0x3fde1e88b9e0 "She leaned. He lit a ">
#<Nokogiri::XML::Element:0x3fde1e88b5f8 name="b" children=[#<Nokogiri::XML::Text:0x3fde1eba6c60 "candle">]>
#<Nokogiri::XML::Text:0x3fde1eba6454 ". They followed him in silence
their feet falling in soft thuds on the thickly carpeted stairs.">
#<Nokogiri::XML::Element:0x3fde1eba5f54 name="br">]>]>]>] 

Sorry for the length. Can anyone help me figure out why this happens? And/or how to prevent it?

Upvotes: 2

Views: 235

Answers (2)

mechanicalfish
mechanicalfish

Reputation: 12826

When you parse incomplete html, Nokogiri automatically adds doctype and html and body elements. You have to parse it like this to avoid this behavior:

body = Nokogiri::HTML::DocumentFragment.parse(your_html)

If you want result to be and array of elements excluding text nodes you can do this:

result = body.xpath('./*')

Then the result (converted to string for clarity) will be:

["<h3>Lancers were arranged. </h3>",
 "<div>Gabriel found himself partnered with Miss Ivors.</div>",
 "<br>",
 "<b>candle</b>",
 "<br>"]

Upvotes: 2

Alex.Bullard
Alex.Bullard

Reputation: 5563

This happens because traverse calls the provided block on itself and all its children recursively. So it adds every node of your html string to the result array instead of just the top level nodes. The 'multiple results' you are seeing is a result of how inspect is defined for Nokogiri nodes. For example the 3rd element in the returned array represents the h3 node, but also prints all of its children, which includes the text node that is the 2nd element of the array.

If you want result to include a reference to every node in the document than this is the right way to do it. If you just want top level nodes use children

Upvotes: 5

Related Questions