Ruby character encoding issue with scraped HTML

Question

I'm having a character encoding issue with a Ruby script that does some HTML scraping and parsing with the Nokogiri gem. At one point in the script, I call join(" ") on an array of strings that have been pulled from some HTML, which causes this error:

./script.rb:333:in `join': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

In my logs, I can see CafÃ© showing up for some of the strings that would be included in the join operation.

Is it that some of the strings in my array to be joined are ASCII-8BIT and some are UTF-8 and ruby can't combine them? Do I need to convert or sanitize my strings after parsing them with Nokogiri (into UTF-8)?.

I tried force_encoding('UTF-8') and encode('UTF-8') on the scraped HTML content before I do anything else with it, but it didn't help. In fact, after I tried encode('UTF-8'), my script crashed even earlier when it called to_s on a string containing CafÃ©.

Character encoding always really confuses me. Is there something else I can do to sanitize the strings to avoid this error?

Edit:
I was doing something similar in Perl recently and used a module called Text::Unidecode and was able to pass my strings to a function that translates any problematic characters e.g. the letter a with an acute to the plain letter a. Is there anything similar for ruby? (This isn't necessarily what I'm aiming for though, if I can keep the a with acute then that's preferable I think.

Edit2:
~~I'm really confused by this and it's proving difficult to reproduce reliably. Here's some code:~~

[CODE REMOVED]

Edit3:
I removed the previously posted code example because it wasn't correct. But the bottom line is, whenever I try to print or call to_s on the string that was scraped, I get the encoding error.

Edit4:
It turned out in the end that the scraped html input was not what was causing the problem. I got the encoding error whenever I tried to print or call to_s on a hash containing, among other things, the scraped html text. The 'other things' were values from database queries, and they were being returned in ASCII-8BIT. To fix the issue, I explicitly had to call force_encoding('UTF-8') on each database value that I use (although I hear that the mysql2 gem does this automatically so I should switch to that).

I hate character encoding.

mu is too short · Accepted Answer

Presumably, CafÃ© is supposed to be Café. If we start out with Café in UTF-8 but treat the bytes as though they were encoded in ISO-8859-1 (AKA Latin-1) and then re-encode them as UTF-8, we get the CafÃ© that you're seeing; for example:

> s = 'Café'
 => "Café" 
> s.encoding
 => # 
> s.force_encoding('iso-8859-1').encode('utf-8')
 => "CafÃ©"

So somewhere you're reading a UTF-8 string but treating it as Latin-1 and re-encoding it as UTF-8. I'd guess that Nokogiri is reading the page and thinking that it is Latin-1 or being told by your user agent that it is getting Latin-1 text. Perhaps you have a bad default encoding somewhere, or the HTTP headers are lying about the encoding, or the page itself is lying about its encoding.

You need to get everything into UTF-8 at the edges of your scraper. Figure out who is lying about the encoding and sort it out right there.

Don't feel bad, scraping and encoding is a nightmare of confusion, stupidity, guesswork, and hard liquor. Servers lie, pages lie, browsers lie, no one is happy.

Ruby character encoding issue with scraped HTML

Answers (1)

Related Questions