Mike Hadlow
Mike Hadlow

Reputation: 9517

How do you read UTF-8 characters from an infinite byte stream - C#

Normally, to read characters from a byte stream you use a StreamReader. In this example I'm reading records delimited by '\r' from an infinite stream.

using(var reader = new StreamReader(stream, Encoding.UTF8))
{
    var messageBuilder = new StringBuilder();
    var nextChar = 'x';
    while (reader.Peek() >= 0)
    {
        nextChar = (char)reader.Read()
        messageBuilder.Append(nextChar);

        if (nextChar == '\r')
        {
            ProcessBuffer(messageBuilder.ToString());
            messageBuilder.Clear();
        }
    }
}

The problem is that the StreamReader has a small internal buffer, so if the code waiting for an 'end of record' delimiter ('\r' in this case) it has to wait until the StreamReader's internal buffer is flushed (usually because more bytes have arrived).

This alternative implementation works for single byte UTF-8 characters, but will fail on multibyte characters.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
while ((byteAsInt = stream.ReadByte()) != -1)
{
    var nextChar = Encoding.UTF8.GetChars(new[]{(byte) byteAsInt});
    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

How can I modify this code so that it works with multi-byte characters?

Upvotes: 6

Views: 5003

Answers (4)

Arno Tolmeijer
Arno Tolmeijer

Reputation: 61

Mike, I found your solution perfect for my situation as well. But I noticed that sometimes it takes four GetChar() calls to determine the characters to be returned. This meant that charCount was 2, while my nextChar buffer size was 1. So I got error "The output character buffer is too small to contain the decoded characters, encoding Unicode fallback System.Text.DecoderReplacementFallback."

I changed my code to:

    // ...
    var nextChar = new char[4];  // 2 might suffice

    for (var i = startPos; i < bytesRead; i++)
    {
        int charCount;
        //...
        charCount = decoder.GetChars(buffer, i, 1, nextChar, 0);

        if (charCount == 0)
        {
            bytesSkipped++;
            continue;
        }

        for (int ic = 0; ic < charCount; ic++)
        {
            char c = nextChar[ic];
            charPos++;

            // Process character here...
        }
    }

Upvotes: 1

phoog
phoog

Reputation: 43036

I don't understand why you're not using the stream reader's ReadLine method. If there's a good reason not to, however, it nonetheless seems to me that repeatedly calling GetChars on the decoder is inefficient. Why not make use of the fact that the byte representation of '\r' can't be part of a multi-byte sequence? (Bytes in a multi-byte sequence must be greater than 127; that is, they have the highest bit set.)

var messageBuilder = new List<byte>();

int byteAsInt;
while ((byteAsInt = stream.ReadByte()) != -1)
{
    messageBuilder.Add((byte)byteAsInt);

    if (byteAsInt == '\r')
    {
        var messageString = Encoding.UTF8.GetString(messageBuilder.ToArray());
        Console.Write(messageString);
        ProcessBuffer(messageString);
        messageBuilder.Clear();
    }
}

Upvotes: 1

Mike Hadlow
Mike Hadlow

Reputation: 9517

Thanks to Richard, I now have a working infinite stream reader. As he explained, the trick is to use a Decoder instance and call its GetChars method. I've tested it with multi-byte Japanese text and it works fine.

int byteAsInt = 0;
var messageBuilder = new StringBuilder();
var decoder = Encoding.UTF8.GetDecoder();
var nextChar = new char[1];

while ((byteAsInt = stream.ReadByte()) != -1)
{
    var charCount = decoder.GetChars(new[] {(byte) byteAsInt}, 0, 1, nextChar, 0);
    if(charCount == 0) continue;

    Console.Write(nextChar[0]);
    messageBuilder.Append(nextChar);

    if (nextChar[0] == '\r')
    {
        ProcessBuffer(messageBuilder.ToString());
        messageBuilder.Clear();
    }
}

Upvotes: 7

Richard
Richard

Reputation: 108975

Rather than Encoding.UTF8.GetChars which is designed to convert complete buffers, get an instance of Decoder and repeatedly call its member method GetChars this will make use of the Decoder's internal buffer to handle partial multi-byte sequences from the end of one call to the next.

Upvotes: 10

Related Questions