Reputation: 774
Why does the follow code output a hex string which differs from the contents of the file when viewed in a hex editor?
Console.Write(String.Concat(TheUTF7String.Select(c => ((int)c).ToString("x2"))).Substring(0, 40));
using (StreamWriter outfile = new StreamWriter("C:\\test", true))
{
outfile.Write(TheUTF7String);
}
Console Output
1f8b0800000000000003c57d6b931cc5b1e867eb
File Contents (First 32 Bytes) When Viewed In A Hex Editor
1F C2 8B 08 00 00 00 00 00 00 03 C3 85 7D 6B C2 93 1C C3 85 C2 B1 C3 A8 67 C3 AB 57 34 C3 A3 C2
To Address Phoog's Answer:
No, it doesn't look like one character from TheUTF7String is being outputed as more than 2 hex characters:
for (int i = 0; i < 20; i++)
Console.Write(TheUTF7String.Select(c => ((int)c).ToString("x2")).ToArray()[i] + " ");
Outputs: 1f 8b 08 00 00 00 00 00 00 03 c5 7d 6b 93 1c c5 b1 e8 67 eb
Upvotes: 0
Views: 581
Reputation: 941218
Not really, it is binary data: "▼ ♥Å}k?∟űègë"
Binary data must be stored in a byte[]. It cannot be stored in a System.String, Unicode normalization will randomly destroy the data and your program will randomly crash when the binary data happens to match one of the surrogate values.
Why is StreamWriter not behaving as expected
Binary data must be written by FileStream. StreamWriter cannot write binary data, only text. It will randomly destroy binary data when it encodes the string. Utf-8 in your case, the default, producing the extra bytes.
The first quote is the most important one, this went off the rails when you assumed you could store the data in a string. StreamWriter was the fairly inevitable next mistake. You must use byte[] instead. This probably means that you have to fix whatever code that obtains the data.
Upvotes: 2
Reputation: 43036
The facile answer is "because your expectations are wrong." More helpfully, I hope:
Despite the name of your string, it is a UTF-16 string (sort of). All .NET strings are encoded this way in memory.
The default encoding for the stream writer is UTF-8, so that's what you're getting in the file.
Your buffer has the UTF-7 data. When you call Encoding.UTF7.GetString(buffer, 0, size)
, you get the in-memory UTF-16 representation of the same character sequence. When you write to the StreamWriter, it calls Encoding.GetBytes to convert the string to the bytes it writes in your file. Since it's using UTF-8 as its default encoding, you get UTF-8 data in the file.
For any values in the range 128-255 (\u0080
to \u00ff
), the UTF-16 character will convert to a two-digit hex code, but the UTF-8 sequence for that character will have two bytes. This explains the difference between your console output and the hex editor.
The character 8B
is represented in UTF-8 as C2 8B
; in UTF-16 it is 8B 00
(because the intel chip is "little endian") and when converted to int and then to a hex string, it is, of course "8B". The UTF-7 representation seems to be 2B 41 49 73 2D
.
If you pass Encoding.Unicode to the StreamWriter, you should get the same as the console output in your hex editor, except you'll have extra 00
bytes, since A
is represented as 41 00
in memory, but when you convert it to int and call ToString("x2"), you get "41" without the "00".
EDIT:
I just thought of another way of looking at it. The GetString
method decodes a byte sequence, returning the corresponding string, while the GetBytes
method encodes a string into a corresponding byte sequence. You can ignore the in-memory representation of the string. (However, for your diagnostic console output, you need to keep in mind that a string is a sequence of characters, while a byte array is a sequence of, well, bytes.)
Upvotes: 2