Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

Question

I am getting ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.

String text = Encoding.UTF8
                      .GetString(Encoding.GetEncoding("iso-8859-1")
                      .GetBytes("ÐÐ¸ÑÐ¸Ð»Ð» ÐÐ°ÑÐ°Ð½Ð½Ð¸Ðº"));

But isn't this hardcoding "iso-8859-1", as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.

Thanks in advance.

Daniel A.A. Pelsmaeker · Accepted Answer

When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.

string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);

Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.

string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!

Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.

byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!

Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:

string typedByUser = "こんにちは、世界！";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界！

The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.

However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.

By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

Answers (2)

Related Questions