Reputation: 631
I'm trying to read some file as bytes and compare it with "\u0019\u0093\r\n\u001a\n". And I'sure I'll always get byte[]{ 0x19, 0x93, 0x0d, 0x0a, 0x1a, 0x0a }.
I tried convert these bytes to string and compared to the string, but always false.
So I try to convert the string to bytes. But also always false when I compare them.
(Using .NET Core 3.0 on Windows 10)
I tried like the following code
byte[] bytes = new byte[]{ 0x19, 0x93, 0x0d, 0x0a, 0x1a, 0x0a };
string s = "\u0019\u0093\r\n\u001a\n";
System.Console.WriteLine(Encoding.Default.GetString(bytes) == s);
System.Console.WriteLine(s.Length);
foreach (var b in Encoding.Default.GetBytes(s))
{
System.Console.WriteLine("Byte: "+b);
}
System.Console.WriteLine(Encoding.Default.GetString(bytes) == s);
the output is:
False
6
Byte: 25
Byte: 194
Byte: 147
Byte: 13
Byte: 10
Byte: 26
Byte: 10
False
The compare always return false. I found that after the conversion from string to bytes I got one more extra byte and have no idea where had the 194 come from. Why does this happen?
I suppose they should be equal after conversion. Is it wrong?
What should I do if I want to get what I expect?
Upvotes: 1
Views: 1045
Reputation: 70671
The character code at issue, in your original encoded bytes, is 0x0093
.
The problem you are running into is that in the Default
encoding on your system (which on Windows is going to be whatever the current code page for system is), the character encoded as 0x0093
is unrecognized. So when you attempt to decode it, you get UTF16 character point 0xfffd
(which is the default for the .NET decoders for unrecognized characters). This is then encoded back into your default encoding as 0x93c2
(the sequence of bytes you see in your output, in decimal that is 194
followed by 147
).
For what it's worth, this behavior is consistent with your default encoding being set to UTF8, maybe indicating it's a Linux system (most Windows systems will use some locale-specific code page as the default encoding and not UTF8).
If you want for the original byte 0x93
to translate into a UTF16 character having essentially the same value (i.e. 0x0093
, aka '\u0093'
), then you need to decode the original bytes using a text encoding where the code point 0x93
does in fact translate to the UTF16 code point 0x0093
.
Fortunately, there's a web site that will in fact tell us for which encodings include this character, and what their value is: https://www.compart.com/en/unicode/charsets/containing/U+0093
And from that table, we can see a large number of encodings where this is the case (there are also some encodings where the UTF16 character '\u0093'
is encoded as a different value, namely 0x33
…obviously, we don't want any of these). The first encoding in the list — "ISO-8859-1" — appears suitable, so let's try using that to decode your bytes:
byte[] bytes = new byte[] { 0x19, 0x93, 0x0d, 0x0a, 0x1a, 0x0a };
string s = "\u0019\u0093\r\n\u001a\n";
Encoding encoding = Encoding.GetEncoding("iso-8859-1");
System.Console.WriteLine(encoding.GetString(bytes) == s);
System.Console.WriteLine(s.Length);
foreach (var b in encoding.GetBytes(s))
{
System.Console.WriteLine("Byte: " + b);
}
System.Console.WriteLine(encoding.GetString(bytes) == s);
This outputs just what you want:
True 6 Byte: 25 Byte: 147 Byte: 13 Byte: 10 Byte: 26 Byte: 10 True
And the bytes displayed are even the exact bytes in your bytes
array, which we can demonstrate by adding this line to the end of your program:
System.Console.WriteLine(encoding.GetBytes(s).SequenceEqual(bytes));
That will also print True
.
And the moral of the story is: knowing the original encoding of the bytes you're trying to decode is not optional. You must know exactly which encoding was used, because it's just that: an encoding. You might as well be trying to decode encrypted data, if you are using the wrong encoding.
Different text encodings are, by definition, different. That means that the bytes in one encoding mean something completely different than they do in some other encoding (sort of…most encodings overlap in the lowest 128 code points, because they are all based on ASCII). You'll just get random results if you use the wrong encoding to decode some bytes (or, as in this case, the decoder will simply not recognize the character and translate it into a placeholder that represents an unrecognized character).
Upvotes: 1