Reputation: 1085

Nokogiri Ruby - Remove the <!DOCTYPE ... > from the output html

I am parsing an html file using nokogiri and modifying it and then outputting it to a file like this:

htext= File.open(inputOpts.html_file).read
h_doc = Nokogiri::HTML(htext)
File.open(outputfile, 'w+')  do |file|
  file.write(h_doc)
end

The output file contains the first line as:

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

I do not want this because I am embedding the html in a different file and this tag is causing issues.

Question is how do I remove this from h_doc.

Upvotes: 5

Answers (4)

John

Reputation: 9456

I managed to work around this with both an HTML::Document and an HTML::DocumentFragment.

For background, I'm using Nokogiri to parse and modify "templates", "partials" and/or "components" of HTML. This means that the files I encounter are not valid HTML documents. They are, instead, pieces of an HTML document that gets put together by the framework I'm using.
For reference, HTML::Document adds the <!DOCTYPE> declaration and also wraps your document into <html> and <body> entities if they are not already present in your document. Similarly, HTML::DocumentFragment will wrap your fragment with <p> entity.

Rather than spending too much time digging into the Nokogiri library code to understand where these additional entities were being, I decided to accept to this opinionated implementation and work around it.

Solution

Here's how I write out my modified HTML:

html_str = doc.xpath("//body").children.to_html(encoding: 'UTF-8')
File.open(_filename, 'w') {|f| f.write(html_str)}

Final Word

This seems harder than it should be. I even tried using the SaveOptions setting save_with: Nokogiri::XML::Node::SaveOptions::NO_DECLARATION to no avail.

In any case, while this solution is a bit kludgey for my liking, it works.

Upvotes: 0

matt

Reputation: 79723

Depending on what you are trying to do, you could parse your HTML as a DocumentFragment:

h_doc = Nokogiri::HTML::DocumentFragment.parse(htext)

When calling to_s or to_html on a fragment the doctype line will be omitted, as will the <html> and <body> tags that Nokogiri adds if they aren’t already present.

Upvotes: 8

engineersmnky

Reputation: 29318

It depends on your needs. If all you need is the body then

h_doc.at_xpath("//body") #this will just pull the data from the <body></body> tags

If you need to collect the <head> too and just avoid the <DOCTYPE> then

#this will capture everything between the <head></head> and <body></body> tags
h_doc.xpath("//head") + h_doc.xpath("//body")

So something like this

h_doc = Nokogiri::HTML(open(input_opts.html_file))
File.open(outputfile,'w+') do |file|
  #for just <body>
  file << h_doc.at_xpath("//body").to_s
  #for <head> and <body>
  file << (h_doc.xpath("//head") + h_doc.xpath("//body")).to_s
end

Notice for body I used #at_xpath as this will return a Nokogiri::Element but when combining them I used #xpath becuase this will return a Nokogiri::XML::NodeSet. No need to worry this part is just for the combination and the html will come out the same e.g. h_doc.at_xpath("//head").to_s == h_doc.xpath("//head").to_s #=> true

Upvotes: 3

Anthony

Reputation: 15957

You can just ignore the first line when reading the input file:

htext= File.readlines(inputOpts.html_file)[1..-1].join
h_doc = Nokogiri::HTML(htext)
File.open(outputfile, 'w+')  do |file|
  file.write(h_doc)
end

Upvotes: 0

Nokogiri Ruby - Remove the &lt;!DOCTYPE ... &gt; from the output html

Answers (4)

Solution

Final Word

Related Questions

Nokogiri Ruby - Remove the <!DOCTYPE ... > from the output html