Reputation: 47
I have code like this:
doc = Nokogiri::HTML.fragment(html)
doc.to_html
and an HTML fragment which will be parsed:
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
Nokogiri deletes the <html>
</html>
tags in the <code>
block. How can I prevent this behavior?
UPDATE:
the Tin Man proposed solution, pre parse fragment of html and escape all html in code block
Here some code, it's not beautiful so if you want suggest another solution please post a comment
html.gsub!(/<code\b[^>]*>(.*?)<\/code>/m) do |x|
"<code>#{CGI.escapeHTML($1)}</code>"
end
Thanks the Tin Man
Upvotes: 1
Views: 292
Reputation: 160631
The problem is that the HTML is invalid. I used this to test it:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT
puts doc.errors
After parsing a document, Nokogiri will populate the errors
array with a list of errors it found during parsing. In the case of your HTML, doc.errors
contains:
htmlParseStartTag: misplaced <html> tag
The reason is that, inside the <code>
block, the tags are not HTML encoded as they should be.
Convert it using HTML entities to:
<html>
<p>
qwerty
</p>
</html>
And it will work.
Nokogiri is a XML/HTML parser, and it attempts to fix errors in the markup to allow you, the programmer, to have a good chance of using the document. In this case, because the <html>
block is in the wrong place, it removes the tags. Nokogiri wouldn't care if the tags were encoded, because, at that point, they're simply text, not tags.
EDIT:
I'll try pre parse with gsub and convert html in code block
require 'nokogiri'
html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '<\1html>'))
puts doc.to_html
Which outputs:
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EDIT:
This will process the <html>
tag prior to parsing, so Nokogiri can load the <code>
block unscathed. It then finds the <code>
block, unescapes the encoded <html>
start and end tags, then inserts the resulting text into the <code>
block as its content. Because it is inserted as content, when Nokogiri renders the DOM as HTML the text is reencoded as entities where necessary:
require 'cgi'
require 'nokogiri'
html = <<EOT
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(%r[<(/?)html>], '<\1html>'))
code = doc.at('code')
code.content = CGI::unescapeHTML(code.inner_html)
puts doc.to_html
Which outputs:
<p>some paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
<code>
<html>
<p>
qwerty
</p>
</html>
</code>
<p>some other paragraph</p>
<a href="https://url...com"><span style="color: #a5a5a5;"><i>qwerty</i></span> ytrewq </a>
Upvotes: 3