Jords
Jords

Reputation: 1875

Nokogiri producing different results on heroku?

I'm having a very strange problem and I'd appreciate help tracking it down.

I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.

On my own computer, everything works fine, however on heroku it inserts a </body></html><html> when it hits the character and selectors only return the elements before the weird character.

To illustrate: Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.

Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.

Upvotes: 2

Views: 267

Answers (1)

the Tin Man
the Tin Man

Reputation: 160549

The ^Q is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.

HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.


In Ruby, global search and replace uses String#gsub.

doc = Nokogiri::HTML(html.gsub("\u0011", ''))

Upvotes: 2

Related Questions