Reputation: 307
We are using Nokogiri to create HTML5 pages based on user inputs and we are seeing some strange encoding issues.
In our database table we have an attribute called compiled_html which contains:
<p class="lead align-left">Just testing out some encoding issues:<br><br>Héllo Äre Thésè symbols showing correctly? </p>
After pulling this HTML snippet from our DB and creating a new page Nokogiri outputs:
<p class="lead align-left">Just testing out some encoding issues:<br><br>Héllo Ãre Thésè symbols showing correctly? </p
This is how we are pulling the compiled_html data and using Nokogiri:
page_doc = Nokogiri::HTML::fragment(page.compiled_html)
(sidenote when outputting page_doc after reading it the HTML is correct)
# create page html
builder = Nokogiri::HTML::Builder.new(:encoding => 'UTF-8') do |doc|
doc.html {
doc.head {
doc.title page.name
doc.meta(charset: 'utf-8')
doc.meta(name: 'viewport', content: 'width=device-width, initial-scale=1.0')
doc.meta(name: 'description', content: '')
doc.meta(name: 'author', content: "#{issue.publication.user.firstname} #{issue.publication.user.lastname}")
doc.link(rel: 'stylesheet', href: "themes/#{theme.identifier}/theme.css")
doc.script(type: 'text/javascript', src: "themes/#{theme.identifier}/theme.js")
}
doc.body {
doc << page_doc
}
}
end
We have tried setting the encoding to utf-8 in different ways, but no matter what we try we still get the weird symbols in our output.
This is for a Ruby on Rails 4 app.
Any ideas? Thanks!
Update: If I change:
doc.body {
doc << page_doc
}
To this:
doc.body {
doc.text page_doc
}
Then the character encoding is correct, but all the HTML is not correct as I get
<
instead of
<
etc.
Upvotes: 2
Views: 641
Reputation: 307
Okay so we figured out that it's actually related to Heroku and the version of libxml installed there that gives the problem: Nokogiri adds characters during parsing on Heroku
My brilliant programmer came up with a quick fix solution that fixed our problem:
doc.body {
# doc << page_doc
}
# insert html contents
builder.doc.at_css('body').children = page_doc
Upvotes: 1