Reputation: 7572
I wrote a web crawler in Ruby and I'm using Nokogiri::HTML
to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print
method. However it takes a parameter and I can't figure out what it wants.
My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.
Upvotes: 33
Views: 33244
Reputation: 9065
Simpler and works well
puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml
Upvotes: 1
Reputation: 355
I know I am extremely late to answer this question, but still, I'll leave the answer. I tried all the above steps and it does work to an extent.
Nokogiri
does format the HTML
but does not care about the closing or the opening tag, hence pretty format is out of the picture.
I found a gem called htmlbeautifier that works like a charm. I hope other people who are still searching for the answer will find this valuable.
Upvotes: 1
Reputation: 15199
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print
method is for the "pp" library and the output is useful for debugging only.
There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
It comes down to this:
xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s
It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
Upvotes: 20
Reputation: 303178
The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:
to_xhtml
or to_xml
to specify pretty-printing parametersIn action:
html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'
require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=> <h1>Main Section 1</h1>
#=> <p>Intro</p>
#=> <section>
#=> <h2>Subhead 1.1</h2>
#=> <p>Meat</p>
#=> <p>MOAR MEAT</p>
#=> </section>
#=> <section>
#=> <h2>Subhead 1.2</h2>
#=> <p>Meat</p>
#=> </section>
#=> </section>
puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>
Upvotes: 86
Reputation: 275
My solution was to add a print
method onto the actual Nokogiri
objects. After you run the code in the snippet below, you should just be able to write node.print
, and it'll pretty print the contents. No xslt required :-)
Nokogiri::XML::Node.class_eval do
# Print every Node by default (will be overridden by CharacterData)
define_method :should_print? do
true
end
# Duplicate this node, replace the contents of the duplicated node with a
# newline. With this content substitution, the #to_s method conveniently
# returns a string with the opening tag (e.g. `<a href="foo">`) on the first
# line and the closing tag on the second (e.g. `</a>`, provided that the
# current node is not a self-closing tag).
#
# Now, print the open tag preceded by the correct amount of indentation, then
# recursively print this node's children (with extra indentation), and then
# print the close tag (if there is a closing tag)
define_method :print do |indent=0|
duplicate = self.dup
duplicate.content = "\n"
open_tag, close_tag = duplicate.to_s.split("\n")
puts (" " * indent) + open_tag
self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
puts (" " * indent) + close_tag if close_tag
end
end
Nokogiri::XML::CharacterData.class_eval do
# Only print CharacterData if there's non-whitespace content
define_method :should_print? do
content =~ /\S+/
end
# Replace all consecutive whitespace characters by a single space; precede the
# outut by a certain amount of indentation; print this text.
define_method :print do |indent=0|
puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
end
end
Upvotes: 2
Reputation: 5992
This worked for me:
pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)
I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)
Upvotes: 9
Reputation: 1037
You can try REXML:
require "rexml/document"
doc = REXML::Document.new(xml)
doc.write($stdout, 2)
Upvotes: 4