Reputation: 9549
I have a big CSV file containing contacts, all the non-latin characters are displayed like that:
Zürich (Zürich)
Grône (Grône)
Chesières (Chesières)
Genève (Genève)
I tried to replace them with their right characters, like:
str_replace('ü', 'ü', $string);
They don't change, I tried to insert them in a MySQL database and then replace them, they still be the same.
What should I do?
Upvotes: 0
Views: 5341
Reputation: 211690
Picking this apart, let's look at the crux of the problem.
195, 188
252
195, 188
)The key thing here is that when seeing UTF-8 (multibyte) to Windows-1252 (single byte) encoding errors a single UTF-8 character often ends up as two nonsense characters. Seeing four here suggests a double mangling:
So there it is. Somehow this was run through two layers of mangling, but to undo it you can force-encode Windows-1252 to UTF-8, then pretend it's Windows-1252 and do it again to UTF-8.
Upvotes: 3
Reputation: 20812
Working from what @tadman described, and from the 132 encodings known to my system, there are several combinations that could have resulted in this mojibake.
65001 utf-8 | 1252 iso-8859-1 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 1252 iso-8859-1 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 1254 iso-8859-9 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 1254 iso-8859-9 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 28591 iso-8859-1 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 28591 iso-8859-1 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 28599 iso-8859-9 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 28599 iso-8859-9 | 65001 utf-8 | 1254 iso-8859-9
65001 utf-8 | 65000 utf-7 | 65001 utf-8 | 1252 iso-8859-1
65001 utf-8 | 65000 utf-7 | 65001 utf-8 | 1254 iso-8859-9
So, once you are confident of the exact encodings involved and you check that they are reversible, you can reverse the mojibake like this:
var latin1 = Encoding.GetEncoding(1252, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
var utf8 = Encoding.GetEncoding(65001, EncoderExceptionFallback.ExceptionFallback, DecoderExceptionFallback.ExceptionFallback);
utf8.GetString(latin1.GetBytes(utf8.GetString(latin1.GetBytes("Zürich")))).Dump();
C# (LINQPad)
Func<Encoding, String> format = (encoding) => $"{encoding.CodePage} {encoding.BodyName}";
var encodings = Encoding.GetEncodings().Select(e => e.GetEncoding()).ToList();
(
from encoding1 in encodings
from encoding2 in encodings
from encoding3 in encodings
from encoding4 in encodings
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ü")))) == "ü"
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("ô")))) == "ô"
where encoding4.GetString(encoding3.GetBytes(encoding2.GetString(encoding1.GetBytes("è")))) == "è"
select new { encoding1 = format(encoding1), encoding2 = format(encoding2), encoding3 = format(encoding3), encoding4 = format(encoding4) }
).Dump();
Upvotes: 1
Reputation: 2695
Please check the encoding of the file. Once you know it, you can read it in the proper way.
After that, you can convert the encoding, e.g., to UTF-8.
Upvotes: 4