RTF
RTF

Reputation: 6524

Ruby character encoding issue with scraped HTML

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join("\n") on an array of strings that have been pulled from some HTML, which causes this error:

./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

In my logs, I can see Café showing up for some of the strings that would be included in the join operation.

Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.

I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing Café.

Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?

Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.

Edit2:
I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:

[CODE REMOVED]

Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.

Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).

I hate character encoding.

Upvotes: 0

Views: 489

Answers (1)

mu is too short
mu is too short

Reputation: 434955

Presumably, Café is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the Café that you're seeing; for example:

> s = 'Café'
 => "Café" 
> s.encoding
 => #<Encoding:UTF-8> 
> s.force_encoding('iso-8859-1').encode('utf-8')
 => "Café" 

So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.

You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.

Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Upvotes: 2

Related Questions