Reputation: 13
I am getting ÐиÑилл ÐаÑанник
from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.
String text = Encoding.UTF8
.GetString(Encoding.GetEncoding("iso-8859-1")
.GetBytes("ÐиÑилл ÐаÑанник"));
But isn't this hardcoding "iso-8859-1"
, as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.
Thanks in advance.
Upvotes: 1
Views: 3836
Reputation: 50336
When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.
string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!
Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!
Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:
string typedByUser = "こんにちは、世界!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界!
The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.
However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.
By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.
Upvotes: 3
Reputation: 2097
A source of characters should only be transfered in one encoding, that means it's either iso-8859-1 or something else, but not both at the same time (that means you might be wrong about the reverse engineered cyrillic in the first place)
Could you post the expected UTF-8 output of your input?
Upvotes: 0