Matt
Matt

Reputation: 774

.NET 4.5 - Why is StreamWriter not behaving as expected with string when writing to file?

Why does the follow code output a hex string which differs from the contents of the file when viewed in a hex editor?

Console.Write(String.Concat(TheUTF7String.Select(c => ((int)c).ToString("x2"))).Substring(0, 40));
using (StreamWriter outfile = new StreamWriter("C:\\test", true))
{
    outfile.Write(TheUTF7String);
}

Console Output

1f8b0800000000000003c57d6b931cc5b1e867eb

File Contents (First 32 Bytes) When Viewed In A Hex Editor

1F C2 8B 08 00 00 00 00 00 00 03 C3 85 7D 6B C2 93 1C C3 85 C2 B1 C3 A8 67 C3 AB 57 34 C3 A3 C2

To Address Phoog's Answer:

No, it doesn't look like one character from TheUTF7String is being outputed as more than 2 hex characters:

for (int i = 0; i < 20; i++)
    Console.Write(TheUTF7String.Select(c => ((int)c).ToString("x2")).ToArray()[i] + " ");

Outputs: 1f 8b 08 00 00 00 00 00 00 03 c5 7d 6b 93 1c c5 b1 e8 67 eb

Upvotes: 0

Views: 581

Answers (2)

Hans Passant
Hans Passant

Reputation: 941218

Not really, it is binary data: "▼ ♥Å}k?∟űègë"

Binary data must be stored in a byte[]. It cannot be stored in a System.String, Unicode normalization will randomly destroy the data and your program will randomly crash when the binary data happens to match one of the surrogate values.

Why is StreamWriter not behaving as expected

Binary data must be written by FileStream. StreamWriter cannot write binary data, only text. It will randomly destroy binary data when it encodes the string. Utf-8 in your case, the default, producing the extra bytes.

The first quote is the most important one, this went off the rails when you assumed you could store the data in a string. StreamWriter was the fairly inevitable next mistake. You must use byte[] instead. This probably means that you have to fix whatever code that obtains the data.

Upvotes: 2

phoog
phoog

Reputation: 43036

The facile answer is "because your expectations are wrong." More helpfully, I hope:

Despite the name of your string, it is a UTF-16 string (sort of). All .NET strings are encoded this way in memory.

The default encoding for the stream writer is UTF-8, so that's what you're getting in the file.

Your buffer has the UTF-7 data. When you call Encoding.UTF7.GetString(buffer, 0, size), you get the in-memory UTF-16 representation of the same character sequence. When you write to the StreamWriter, it calls Encoding.GetBytes to convert the string to the bytes it writes in your file. Since it's using UTF-8 as its default encoding, you get UTF-8 data in the file.

For any values in the range 128-255 (\u0080 to \u00ff), the UTF-16 character will convert to a two-digit hex code, but the UTF-8 sequence for that character will have two bytes. This explains the difference between your console output and the hex editor.

The character 8B is represented in UTF-8 as C2 8B; in UTF-16 it is 8B 00 (because the intel chip is "little endian") and when converted to int and then to a hex string, it is, of course "8B". The UTF-7 representation seems to be 2B 41 49 73 2D.

If you pass Encoding.Unicode to the StreamWriter, you should get the same as the console output in your hex editor, except you'll have extra 00 bytes, since A is represented as 41 00 in memory, but when you convert it to int and call ToString("x2"), you get "41" without the "00".

EDIT:

I just thought of another way of looking at it. The GetString method decodes a byte sequence, returning the corresponding string, while the GetBytes method encodes a string into a corresponding byte sequence. You can ignore the in-memory representation of the string. (However, for your diagnostic console output, you need to keep in mind that a string is a sequence of characters, while a byte array is a sequence of, well, bytes.)

Upvotes: 2

Related Questions