Florian Rivoal
Florian Rivoal

Reputation: 501

Disappearing entities in XML fragment with nokogiri

I'm using Nokogiri to process fragments of XHTML documents, and am running into some behavior I cannot explain or workaround. I'm not sure if it's a bug, or something I don't understand.

Consider the following two lines, showcasing a reduced version of the problem I'm running into:

puts Nokogiri::XML::DocumentFragment.parse("&nbsp;<pre>&lt;div>foo&lt;/div></pre>")
puts Nokogiri::XML::DocumentFragment.parse("<pre>&lt;div>foo&lt;/div></pre>")

This is the output:

<pre>div&gt;foo/div&gt;</pre>
<pre>&lt;div&gt;foo&lt;/div&gt;</pre>

The second line is what I expect, but the first one puzzles me. Where did the &nbsp; go? Why does its presence cause the &lt; to disappear?

Upvotes: 1

Views: 352

Answers (1)

Florian Rivoal
Florian Rivoal

Reputation: 501

Based on matt's suggestion, I'm parsing the fragment by wrapping it in a full XHTML file, as that allows Nokogiri to know about the XHTML entities.

fragment = "&nbsp;<pre>&lt;div>foo&lt;/div></pre>"
head = <<HERE
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta charset="UTF-8" />
</head>
<body>
HERE

foot = <<HERE
</body>
</html>
HERE

puts Nokogiri::XML.parse( head + fragment + foot).css("body").children.to_xml

Feels a bit heavy handed, but it works.

Upvotes: 1

Related Questions