Reputation: 9517
Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.
using(var reader = new StreamReader(stream, Encoding.UTF8))
{
var messageBuilder = new StringBuilder();
var nextChar = 'x';
while (reader.Peek() >= 0)
{
nextChar = (char)reader.Read()
messageBuilder.Append(nextChar);
if (nextChar == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
}
The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).
This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
How can I modify this code so that it works with multi-byte characters?
Upvotes: 6
Views: 5003
Reputation: 61
Mike, I found your solution perfect for my situation as well. But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback."
I changed my code to:
// ...
var nextChar = new char[4]; // 2 might suffice
for (var i = startPos; i < bytesRead; i++)
{
int charCount;
//...
charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);
if (charCount == 0)
{
bytesSkipped++;
continue;
}
for (int ic = 0; ic < charCount; ic++)
{
char c = nextChar[ic];
charPos++;
// Process character here...
}
}
Upvotes: 1
Reputation: 43036
I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)
var messageBuilder = new List<byte>();
int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
messageBuilder.Add((byte)byteAsInt);
if (byteAsInt == '\r')
{
var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
Console.Write(messageString);
ProcessBuffer(messageString);
messageBuilder.Clear();
}
}
Upvotes: 1
Reputation: 9517
Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.
int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];
while ((byteAsInt = stream.ReadByte()) != -1)
{
var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
if(charCount == 0) continue;
Console.Write(nextChar[0]);
messageBuilder.Append(nextChar);
if (nextChar[0] == '\r')
{
ProcessBuffer(messageBuilder.ToString());
messageBuilder.Clear();
}
}
Upvotes: 7
Reputation: 108975
Rather than Encoding.UTF8.GetChars
which is designed to convert complete buffers, get an instance of Decoder
and repeatedly call its member method GetChars
this will make use of the Decoder
's internal buffer to handle partial multi-byte sequences from the end of one call to the next.
Upvotes: 10