Reputation: 1875
I'm having a very strange problem and I'd appreciate help tracking it down.
I'm using the nokogiri gem to parse some html, and I am parsing a file which has a weird character in it. Not entirely sure what this character is, in vim it shows as ^Q.
On my own computer, everything works fine, however on heroku it inserts a </body></html><html>
when it hits the character and selectors only return the elements before the weird character.
To illustrate:
Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count
is 1 on heroku, and two on my computer. - The file containing this character can be downloaded from http://thoms.net.nz/e2.html.
Both my computer and heroku are running nokogiri 1.5.5 with ruby 1.9.3.
Upvotes: 2
Views: 267
Reputation: 160549
The ^Q
is a software control character (XON), which isn't supposed to be in HTML. I suspect its unexpected presence is confusing both Nokogiri and Heroku, but in different ways.
HTML documents from the wilds of the internet can be corrupted in any numbers of ways. I've seen all sorts of garbage in them, and if I couldn't make sense of it using iconv or a Unicode transliteration, I'd resort to a quick global search and replace to remove anything not in the normal ASCII range before further processing.
In Ruby, global search and replace uses String#gsub
.
doc = Nokogiri::HTML(html.gsub("\u0011", ''))
Upvotes: 2