medk
medk

Reputation: 9549

German Character Encoding

I have a big CSV file containing contacts, all the non-latin characters are displayed like that:

Zürich (Zürich)

Grône (Grône)

Chesières (Chesières)

Genève (Genève)

I tried to replace them with their right characters, like:

str_replace('ü', 'ü', $string);

They don't change, I tried to insert them in a MySQL database and then replace them, they still be the same.

What should I do?

Upvotes: 0

Views: 5341

Answers (3)

tadman
tadman

Reputation: 211690

Picking this apart, let's look at the crux of the problem.

  • ü in UTF-8: 195, 188
  • ü in Windows-1252: 252
  • ü in UTF-8 misinterpreted as Windows-1252: ü (195, 188)

The key thing here is that when seeing UTF-8 (multibyte) to Windows-1252 (single byte) encoding errors a single UTF-8 character often ends up as two nonsense characters. Seeing four here suggests a double mangling:

  • ü in UTF-8 misinterpreted as Windows-1252: ü
  • ü in UTF-8 misinterpreted as Windows-1252: ü

So there it is. Somehow this was run through two layers of mangling, but to undo it you can force-encode Windows-1252 to UTF-8, then pretend it's Windows-1252 and do it again to UTF-8.

Upvotes: 3

Tom Blodget
Tom Blodget

Reputation: 20812

Working from what @tadman described, and from the 132 encodings known to my system, there are several combinations that could have resulted in this mojibake.

65001 utf-8 | 1252 iso-8859-1    | 65001 utf-8  | 1252 iso-8859-1 
65001 utf-8 | 1252 iso-8859-1    | 65001 utf-8  | 1254 iso-8859-9 
65001 utf-8 | 1254 iso-8859-9    | 65001 utf-8  | 1252 iso-8859-1 
65001 utf-8 | 1254 iso-8859-9    | 65001 utf-8  | 1254 iso-8859-9 
65001 utf-8 | 28591 iso-8859-1   | 65001 utf-8  | 1252 iso-8859-1 
65001 utf-8 | 28591 iso-8859-1   | 65001 utf-8  | 1254 iso-8859-9 
65001 utf-8 | 28599 iso-8859-9   | 65001 utf-8  | 1252 iso-8859-1 
65001 utf-8 | 28599 iso-8859-9   | 65001 utf-8  | 1254 iso-8859-9 
65001 utf-8 | 65000 utf-7        | 65001 utf-8  | 1252 iso-8859-1 
65001 utf-8 | 65000 utf-7        | 65001 utf-8  | 1254 iso-8859-9 

So, once you are confident of the exact encodings involved and you check that they are reversible, you can reverse the mojibake like this:

var latin1 = Encoding.GetEncoding(1252, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
var utf8 = Encoding.GetEncoding(65001, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
utf8.GetString(latin1.GetBytes(utf8.GetString(latin1.GetBytes("Zürich")))).Dump();

C# (LINQPad)

Func<Encoding, String> format = (encoding) => $"{encoding.CodePage} {encoding.BodyName}";
var encodings = Encoding.GetEncodings().Select(e => e.GetEncoding()).ToList();
(
    from encoding1 in encodings
    from encoding2 in encodings
    from encoding3 in encodings
    from encoding4 in encodings
    where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ü")))) == "ü"
    where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ô")))) == "ô"
    where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("è")))) == "è"
    select new { encoding1 = format(encoding1), encoding2 = format(encoding2), encoding3 = format(encoding3), encoding4 = format(encoding4) }
).Dump();

Upvotes: 1

Nils
Nils

Reputation: 2695

Please check the encoding of the file. Once you know it, you can read it in the proper way.

After that, you can convert the encoding, e.g., to UTF-8.

Upvotes: 4

Related Questions