Ben
Ben

Reputation: 21249

Nokogiri to_xhtml puts doctype before <?xml

I'm trying to use Nokogiri to parse and update some xhtml files (fixing image sizes).

The parsing and updating works well but when I save the document with:

doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')

The first two lines change from (original):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

to (output):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<?xml version="1.0" encoding="utf-8"??>

which isn't a valid xml document (and there's also a double ? at the end of the xml tag).

Am I doing some wrong?

Edit: I've got nokogiri (1.6.0) installed, which seems to be the latest version.

Upvotes: 2

Views: 567

Answers (1)

Jacob Brown
Jacob Brown

Reputation: 7561

This problem is an open (though very old) Nokogiri issue on Github, though it may in fact be a libxml issue. I was able to replicate your output.

The quick fix is to parse your document with Nokogiri::XML rather than Nokogiri::HTML, which is probably better practice anyway when dealing with XHTML files:

doc = Nokogiri::XML(open 'wherever')
doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')

Note that this won't preserve your XML processing instruction. If you need it, use to_xml.

Upvotes: 2

Related Questions