John
John

Reputation: 736

detect invalid byte in utf8

I'm implement a text read that read from end to start, I face two problems

  1. the stream can only read forward
  2. can not guarantee the first byte is a start of char if I get a 1kb buffer from the stream end.

I'm looking into the Decoder class and the DecoderFallbackBuffer , but I didn't find a way to detect the invalid bytes.

void Doo() {
     var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
     var b3 = b[2..];
     WriteLine(Encoding.UTF8.GetString(b3));
     WriteLine(decoder.FallbackBuffer.Remaining);
     while (decoder.FallbackBuffer.Fallback(b3.ToArray(), 0))
     {
         var r = decoder.FallbackBuffer.Remaining;
         WriteLine(decoder.FallbackBuffer.Remaining);
         b3 = b3[1..]; // move forward
     }
     WriteLine(Encoding.UTF8.GetString(b3));
 }
 Doo()
�好啊,兄弟们哼哼哼😂
 0
 1
System.ArgumentException: Recursive fallback not allowed for bytes \xE5 \xA5 \xBD \xE5 \x95 \x8A \xEF \xBC \x8C \xE5 \x85 \x84 \xE5 \xBC \x9F \xE4 \xBB \xAC \xE5 \x93 .... (Parameter 'bytesUnknown')
  + System.Text.DecoderFallbackBuffer.ThrowLastBytesRecursive(byte[])
  + System.Text.DecoderReplacementFallbackBuffer.Fallback(byte[], int)
  + Submission#7.Doo()

Is this possible in C# ?
What do I need to detect the first valid byte's position?

Upvotes: 0

Views: 507

Answers (1)

canton7
canton7

Reputation: 42330

You can use the fact that and second and subsequent bytes in a multi-byte UTF-8 sequence all start with 10 (see Wikipedia).

var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];

while (b3.Length > 0 && (b3[0] & 0xC0) == 0x80)
{
    b3 = b3[1..];
}

Console.WriteLine(Encoding.UTF8.GetString(b3));

Alternatively you can use a decoder with a DecoderReplacementFallback to replace all invalid byte sequences with a replacement character, and then strip off the replacement character. Encoding.UTF8 uses a replacement character by default.

If you want to get a string out, just:

var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];

var result = Encoding.UTF8.GetString(b3).TrimStart('\uFFFD');
Console.WriteLine(result);

If you want to get the index of the first valid character without going via a string, I guess you could do something like (not particularly well-tested):

var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];

var decoder = Encoding.UTF8.GetDecoder();
Span<char> chars = stackalloc char[1];
int pos = 0;
while (pos < b3.Length)
{
    decoder.Convert(b3[pos..], chars, false, out int bytesUsed, out int charsUsed, out bool completed);
    if (completed || (charsUsed > 0 && chars[0] != '\uFFFD'))
    {
        break;
    }
    pos += bytesUsed; 
}

Console.WriteLine(Encoding.UTF8.GetString(b3[pos..]));

Upvotes: 2

Related Questions