user2295072
user2295072

Reputation: 13

Generic solution needed for decoding Cyrillic string encoded in UTF-8 in C#

I am getting ÐиÑилл ÐаÑанник from a C++ component and I need to decode it. The string is always UTF-8 encoded. After much RnD, I figured following way to decode it.

String text = Encoding.UTF8
                      .GetString(Encoding.GetEncoding("iso-8859-1")
                      .GetBytes("ÐиÑилл ÐаÑанник"));

But isn't this hardcoding "iso-8859-1", as in what if characters other than cyrillic come up. So I want to have a generic method for decoding a UTF-8 string.

Thanks in advance.

Upvotes: 1

Views: 3836

Answers (2)

Daniel A.A. Pelsmaeker
Daniel A.A. Pelsmaeker

Reputation: 50336

When you type text, the computer sees only bytes. In this case, when you type Cyrillic characters into your C++ program, the computer converts each character to its corresponding UTF-8 encoded character.

string typedByUser = "Привет мир!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);

Then your C++ program comes along, looks at the bytes and thinks it is ISO-8859-1 encoded.

string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ÐÑÐ¸Ð²ÐµÑ Ð¼Ð¸Ñ!

Not much you can do about that. Then you get the wrongly encoded string and have to assume it is incorrectly ISO-8859-1 encoded UTF-8. This assumption proves to be correct, but you cannot determine this in any way.

byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// Привет мир!

Note that ISO-8859-1 is the ISO West-European encoding, and has nothing to do with the fact that the original input was Cyrillic. For example, if the input was Japanese UTF-8 encoded, your C++ program would still interpret it as ISO-8859-1:

string typedByUser = "こんにちは、世界!";
byte[] input = Encoding.UTF8.GetBytes(typedByUser);
string cppString = Encoding.GetEncoding("iso-8859-1").GetString(input);
// ããã«ã¡ã¯ãä¸çï¼
byte[] decoded = Encoding.GetEncoding("iso-8859-1").GetBytes(cppString);
string text = Encoding.UTF8.GetString(decoded);
// こんにちは、世界!

The C++ program will always interpret the input as ISO-8859-1, regardless of whether it is Cyrillic, Japanese or plain English. So that assumption is always correct.

However, you have an additional assumption that the original input is UTF-8 encoded. I'm not sure whether that is always correct. It may depend on the program, the input mechanism it uses and the default encoding used by the Operating System. For example, the C++ program made the assumption that the original input is ISO-8859-1 encoded, which was wrong.


By the way, character encodings have always been problematic. A great example is a letter from a French student to his Russian friend where the Cyrillic address was incorrectly written as ISO-8859-1 on the envelope, and decoded by the postal employees.

Upvotes: 3

Michiel Cornille
Michiel Cornille

Reputation: 2097

A source of characters should only be transfered in one encoding, that means it's either iso-8859-1 or something else, but not both at the same time (that means you might be wrong about the reverse engineered cyrillic in the first place)

Could you post the expected UTF-8 output of your input?

Upvotes: 0

Related Questions