Zhenia
Zhenia

Reputation: 4179

Array and String Encoding

When I do

string s = Encoding.Unicode.GetString(a);
byte[] aa = Encoding.Unicode.GetBytes(s);

I have different arrays (a != aa) . Why ?

But when I do this? It's all right

string s = Encoding.Default.GetString(a);
byte[] aa = Encoding.Default.GetBytes(s);

Upvotes: 2

Views: 340

Answers (3)

Sebastian Negraszus
Sebastian Negraszus

Reputation: 12215

To add to Guffa's answer, here is a detailed example of how your code fails for certain byte sequences, such as 0, 216:

// Let's start with some character from the ancient Aegean numbers:
// The code point of Aegean One is U+10107. Code points > U+FFFF need two
// code units with two bytes each if you encode them in UTF-16 (Encoding.Unicode)
string aegeanOne = char.ConvertFromUtf32(0x10107);
byte[] aegeanOneBytes = Encoding.Unicode.GetBytes(aegeanOne);
// Length == 4 (2 bytes each for high and low surrogate)
// == 0, 216, 7, 221

// Let's just take the first two bytes.
// This creates a malformed byte sequence,
// because the corresponding low surrogate is missing.
byte[] a = new byte[2];
a[0] = aegeanOneBytes[0]; // == 0
a[1] = aegeanOneBytes[1]; // == 216

string s = Encoding.Unicode.GetString(a);
// == replacement character � (U+FFFD),
// because the bytes could not be decoded properly (missing low surrogate)

byte[] aa = Encoding.Unicode.GetBytes(s);
// == 253, 255 == 0xFFFD != 0, 216

string s2 = Encoding.Default.GetString(a);
// == "\0Ø" (NUL + LATIN CAPITAL LETTER O WITH STROKE)
// Results may differ, depending on the default encoding of the operating system

byte[] aa2 = Encoding.Default.GetBytes(s2);
// == 0, 216

Upvotes: 2

Mehmet Ataş
Mehmet Ataş

Reputation: 11549

It means your byte[] a has a byte order which does not conform to Unicode rules.

Upvotes: 0

Guffa
Guffa

Reputation: 700342

That is because you are using encoding backwards. Encoding is used to encode a string to bytes, then back to a string again.

In an encoding every character has a corresponding set of bytes, but not every set of bytes has to have a corresponding character. That's why you can't take any arbitrary bytes and decode into a string.

Using the encoding Default it works to misuse it that way, because it only uses a single byte for each character, and it happens to have a character for every byte code. It still doesn't make sense to use it that way, though.

Upvotes: 11

Related Questions