Alex Peattie
Alex Peattie

Reputation: 27697

Strange character encoding issue

I have some data which has been imported into Postgres, for use in a Rails application. However somehow the foreign accents have become strangely encoded:

I'm pretty sure the problem is with the integrity of the data, rather than any problem with Rails. It doesn't seem to match any encoding I try:

# Replace "cp1252" with any other encoding, to no effect
"Trollâ§ttan".encode("cp1252").force_encoding("UTF-8") #-> junk

If anyone was able to identify what kind of encoding mixup I'm suffering from, that would be great.

As a last resort, I may have to manually replace each corrupted accent character, but if anyone can suggest a programatic solution (or a even a starting point for fixing this - I've found it very hard to debug), I'd be v. grateful.

Upvotes: 1

Views: 490

Answers (1)

Daniel Vérité
Daniel Vérité

Reputation: 61716

It's hardly possible with recent versions of PostgreSQL to have invalid UTF8 inside a UTF8 database. There are other plausible possibilities that may lead to that output, though.

In the typical case of é appearing as ©, either:

  1. The contents of the database are valid, but some client-side layer is interpreting the bytes from the database as if they were iso-latin-something whereas they are UTF8.

  2. The contents are valid and the SQL client-side layer is valid, but the terminal/software/webpage with which you're looking at this is configured for iso-latin1 or a similar mono-bytes encoding (win1252, iso-latin9...).

  3. The contents of the database consist of the wrong characters with a valid UTF8 encoding. This is what you end up with if you take iso-latin-something bytes, convert them to UTF8 representation, then take the resulting byte stream as if if was still in iso-latin, and reconvert it once again to UTF8, and insert that into the database.

Note that while the © sequence is typical in UTF8 versus iso-latin confusion, the presence of an additional â in all your sample strings is uncommon. It may be the result of another misinterpretation on top of the primary one. If you're in case #3, that may mean that an automated fix based on search-replace will be harder than the normal case which is already tricky.

Upvotes: 2

Related Questions