joelc
joelc

Reputation: 2751

Reading a stream that may have non-ASCII characters

I have an application that reads string data in from a stream. The string data is typically in English but on occasion it encounters something like 'Jalapeño' and the 'ñ' comes out as '?'. In my implementation I'd prefer to read the stream contents into a byte array but I could get by reading the contents into a string. Any idea what I can do to make this work right?

Current code is as follows:

byte[] data = new byte[len];  // len is known a priori
byte[] temp = new byte[2];
StreamReader sr = new StreamReader(input_stream);
int position = 0;
while (!sr.EndOfStream)
{
  int c = sr.Read();
  temp = System.BitConverter.GetBytes(c);
  data[position] = temp[0];
  position++;
}
input_stream.Close();
sr.Close();

Upvotes: 3

Views: 4385

Answers (2)

bloudraak
bloudraak

Reputation: 6002

You can pass the encoding to the StreamReader as in:

StreamReader sr = new StreamReader(input_stream, Encoding.UTF8);

However, I understand that Encoding.UTF8 is used by default according to the documentation.

Update

The following reads 'Jalapeño' fine:

byte[] bytes;
using (var stream = new FileStream("input.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    var index = 0;
    var count = (int) stream.Length;
    bytes = new byte[count];
    while (count > 0)
    {
        int n = stream.Read(bytes, index, count);
        if (n == 0)
            throw new EndOfStreamException();

        index += n;
        count -= n;
    }
}

// test
string s = Encoding.UTF8.GetString(bytes);
Console.WriteLine(s);

As does this:

byte[] bytes;
using (var stream = new FileStream("input.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    var reader = new StreamReader(stream);
    string text = reader.ReadToEnd();
    bytes = Encoding.UTF8.GetBytes(text);
}

// test
string s = Encoding.UTF8.GetString(bytes);
Console.WriteLine(s);

From what I understand the 'ñ' character is represented as 0xc391 in the text when the text is stored with UTF encoding. When you only read a byte, you'll loose data.

I'd suggest reading the whole stream as a byte array (the first example) and then do the encoding. Or use StreamReader to do the work for you.

Upvotes: 4

Marc Gravell
Marc Gravell

Reputation: 1062502

Since you're trying to fill the contents into a byte-array, don't bother with the reader - it isn't helping you. Use just the stream:

byte[] data = new byte[len];
int read, offset = 0;
while(len > 0 &&
    (read = input_stream.Read(data, offset, len)) > 0)
{
    len -= read;
    offset += read;
}
if(len != 0) throw new EndOfStreamException();

Upvotes: 1

Related Questions