Reputation:
I have Unicode strings stored in a database. Some of the character encodings are wrong and instead of displaying actual characters for the language, it's now displaying characters that make no sense. How do I fix this issue? Is there a way to detect if strings have a wrong encoding?
Upvotes: 1
Views: 1962
Reputation: 113222
The problem with mojibake (the Japanese slang "mojibake" gets used in English because the historical status of Japan as a non-Western country with heavy early computer use meant the issue was encountered a lot there) is that the characters will generally be valid in themselves, but nonsense, which is much harder to detect with 100% accuracy.
The first thing you need to do is identify the encoding that the data was really in, the encoding the data was read as being in, and write a converter to undo that.
For example, if UTF-8 had been mis-interpreted as ISO 8859-1, then you would want to read through the stream, and create the binary stream of encoding it back into ISO 8859-1, and then create the text stream of reading that binary stream as UTF-8, as should have been done in the first place.
Now for the hard part, finding the incorrect streams. If you can do this by some means that isn't heuristic, then this is the way to go (e.g. if you knew that every record added within a particular range of id numbers was invalid, just use that).
Failing that, your best bet is to do some heuristics as follows:
Note that we can compute such sequences if we have System.Text.Encoding objects that correspond to the mojikbake. If for example you had read as your system's default encoding when you should have read as UTF-8 then you could use:
Encoding.Default.GetString(Encoding.UTF8.GetBytes(testString))
For example:
Encoding.Default.GetString(Encoding.UTF8.GetBytes("ç"))
returns "ç".
Upvotes: 3