Jarsen
Jarsen

Reputation: 7572

How do I pretty-print HTML with Nokogiri?

I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

Upvotes: 33

Views: 33244

Answers (8)

Dorian
Dorian

Reputation: 9065

Simpler and works well

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

Upvotes: 1

Abeid Ahmed
Abeid Ahmed

Reputation: 355

I know I am extremely late to answer this question, but still, I'll leave the answer. I tried all the above steps and it does work to an extent.

Nokogiri does format the HTML but does not care about the closing or the opening tag, hence pretty format is out of the picture.

I found a gem called htmlbeautifier that works like a charm. I hope other people who are still searching for the answer will find this valuable.

Upvotes: 1

mislav
mislav

Reputation: 15199

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

Upvotes: 20

Phrogz
Phrogz

Reputation: 303178

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

Upvotes: 86

pariser
pariser

Reputation: 275

My solution was to add a print method onto the actual Nokogiri objects. After you run the code in the snippet below, you should just be able to write node.print, and it'll pretty print the contents. No xslt required :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

Upvotes: 2

bronson
bronson

Reputation: 5992

This worked for me:

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3) 

I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)

Upvotes: 9

Julien
Julien

Reputation: 1037

You can try REXML:

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

Upvotes: 4

khelll
khelll

Reputation: 23990

why don't you try the pp method?

require 'pp'
pp some_var

Upvotes: -6

Related Questions