Reputation: 1327

Before reading a file, do I have to check ANSI encoding?

Im reding some csv files. The files are really easy, because there is always just ";" as seperator and there are no ", ', or something like that.

So its possible to read the file, line by line and seperate the strings. Thats working fine. Now people told me: maybe you should check the encoding of the file, it should be always ANSI, if its not maybe your output will be different and corrupted. So non-ansi files should be marked somehow.

I just said, okey! But if I think about it: do I really have to check the file for encoding in this case? I just changed the encoding of the file to something else and Im still able to read the file without any problems. My code is simple:

using (TextReader reader = new StreamReader(myFileStream))
{
  while ((line = read.ReadLine()) != null)
  {
    //read the line, spererate by ; and other stuff...
  }
}

So again: do I really need to check the files for ANSI encoding? Could somebody give me an example when could I get in trouble or when do I get a corrupted output after reading a non-ansi file? Thank you!

Upvotes: 0

Answers (2)

Marc Gravell

Reputation: 1063198

That particular constructor of StreamReader will assume that the data is UTF-8; that is compatible with ASCII, but can fail if data uses bytes in the 128-255 range for single-byte codepages (you'll get the wrong characters in strings, etc), or could fail completely (i.e. throw an exception) if the data is actually something very different like UTF-7, UTF-32, etc.

In some cases (the minority) you might be able to use the byte-order-mark to detect the encoding, but this is a circular problem: in most cases, if you don't already know the encoding, you can't really detect the encoding (robustly). So a better approach would be: to know the encoding in the first place. Then you can pass in the correct encoding to use via one of the other constructors.

Here's an example of it failing:

// we'll write UTF-32, big-endian, without a byte-order-mark
File.WriteAllText("my.txt", "Hello world", new UTF32Encoding(true, false));

using (var reader = new StreamReader("my.txt"))
{
    string s = reader.ReadLine();
}

Upvotes: 3

Tigran

Reputation: 62276

You can run under UTF-8 encoding , cause UTF-8 has a wonderful property support ASCII characters with 1 byte (as it would expected), but when it needed, shrink to support Unicode ones.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Upvotes: 1

Before reading a file, do I have to check ANSI encoding?

Answers (2)

Related Questions