Cheng Chen
Cheng Chen

Reputation: 43503

Text file encoding issue

I found some questions on encoding issues before asking, however they are not what I want. Currently I have two methods, I'd better not modify them.

//FileManager.cs
public byte[] LoadFile(string id);
public FileStream LoadFileStream(string id);

They are working correctly for all kind of files. Now I have an ID of a text file(it's guaranteed to be a .txt file) and I want to get its content. I tried the following:

byte[] data = manager.LoadFile(id);
string content = Encoding.UTF8.GetString(data);

But obviously it's not working for other non-UTF8 encodings. To resolve the encoding issue I tried to get its FileStream first and then use a StreamReader.

public StreamReader(Stream stream, bool detectEncodingFromByteOrderMarks);

I hope this overlord can resolve the encoding but I still get strange contents.

using(var stream = manager.LoadFileStream(id))
using(var reader = new StreamReader(stream, true))
{
    content = reader.ReadToEnd();    //still incorrect
}

Maybe I misunderstood the usage of detectEncodingFromByteOrderMarks? And how to resolve the encoding issue?

Upvotes: 2

Views: 333

Answers (1)

C.Evenhuis
C.Evenhuis

Reputation: 26436

ByteOrderMarks are sometimes added to files encoded in one of the unicode formats, to indicate whether characters made up from multiple bytes are stored in big or little endian format (is byte 1 stored first, and then byte 0? Or byte 0 first, and then byte 1?). This is particularly relevant when files are read both by for instance windows and unix machines, because they write these multibyte characters in opposite directions.

If you read a file and the first few bytes equal that of a ByteOrderMark, chances are quite high the file is encoded in the unicode format that matches that ByteOrderMark. You never know for sure, though, as Shadow Wizard mentioned. Since it's always a guess, the option is provided as a parameter.

If there is no ByteOrderMark in the first bytes of the file, it'll be hard to guess the file's encoding.

More info: http://en.wikipedia.org/wiki/Byte_order_mark

Upvotes: 1

Related Questions