Nicholas John Martin
Nicholas John Martin

Reputation: 307

Ruby Nokogiri::HTML::Builder Encoding Problems

We are using Nokogiri to create HTML5 pages based on user inputs and we are seeing some strange encoding issues.

In our database table we have an attribute called compiled_html which contains:

<p class="lead align-left">Just testing out some encoding issues:<br><br>Héllo Äre Thésè symbols showing correctly? </p>

After pulling this HTML snippet from our DB and creating a new page Nokogiri outputs:

<p class="lead align-left">Just testing out some encoding issues:<br><br>Héllo Ãre Thésè symbols showing correctly? </p

This is how we are pulling the compiled_html data and using Nokogiri:

page_doc = Nokogiri::HTML::fragment(page.compiled_html)

(sidenote when outputting page_doc after reading it the HTML is correct)

# create page html
    builder = Nokogiri::HTML::Builder.new(:encoding => 'UTF-8') do |doc|
      doc.html {
        doc.head {
          doc.title page.name
          doc.meta(charset: 'utf-8')
          doc.meta(name: 'viewport', content: 'width=device-width, initial-scale=1.0')
          doc.meta(name: 'description', content: '')
          doc.meta(name: 'author', content: "#{issue.publication.user.firstname} #{issue.publication.user.lastname}")
          doc.link(rel: 'stylesheet', href: "themes/#{theme.identifier}/theme.css")
          doc.script(type: 'text/javascript', src: "themes/#{theme.identifier}/theme.js")
        }
        doc.body {
          doc << page_doc
        }
      }
    end

We have tried setting the encoding to utf-8 in different ways, but no matter what we try we still get the weird symbols in our output.

This is for a Ruby on Rails 4 app.

Any ideas? Thanks!

Update: If I change:

doc.body {
          doc << page_doc
        }

To this:

doc.body {
          doc.text page_doc
        }

Then the character encoding is correct, but all the HTML is not correct as I get

&lt; 

instead of

< 

etc.

Upvotes: 2

Views: 641

Answers (1)

Nicholas John Martin
Nicholas John Martin

Reputation: 307

Okay so we figured out that it's actually related to Heroku and the version of libxml installed there that gives the problem: Nokogiri adds characters during parsing on Heroku

My brilliant programmer came up with a quick fix solution that fixed our problem:

    doc.body {
      # doc << page_doc
    }

    # insert html contents
    builder.doc.at_css('body').children = page_doc

Upvotes: 1

Related Questions