Reputation: 736
I'm implement a text read that read from end to start, I face two problems
char
if I get a 1kb buffer from the stream end.I'm looking into the Decoder
class and the DecoderFallbackBuffer
, but I didn't find a way to detect the invalid bytes.
void Doo() {
var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];
WriteLine(Encoding.UTF8.GetString(b3));
WriteLine(decoder.FallbackBuffer.Remaining);
while (decoder.FallbackBuffer.Fallback(b3.ToArray(), 0))
{
var r = decoder.FallbackBuffer.Remaining;
WriteLine(decoder.FallbackBuffer.Remaining);
b3 = b3[1..]; // move forward
}
WriteLine(Encoding.UTF8.GetString(b3));
}
Doo()
�好啊,兄弟们哼哼哼😂
0
1
System.ArgumentException: Recursive fallback not allowed for bytes \xE5 \xA5 \xBD \xE5 \x95 \x8A \xEF \xBC \x8C \xE5 \x85 \x84 \xE5 \xBC \x9F \xE4 \xBB \xAC \xE5 \x93 .... (Parameter 'bytesUnknown')
+ System.Text.DecoderFallbackBuffer.ThrowLastBytesRecursive(byte[])
+ System.Text.DecoderReplacementFallbackBuffer.Fallback(byte[], int)
+ Submission#7.Doo()
Is this possible in C# ?
What do I need to detect the first valid byte's position?
Upvotes: 0
Views: 507
Reputation: 42330
You can use the fact that and second and subsequent bytes in a multi-byte UTF-8 sequence all start with 10
(see Wikipedia).
var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];
while (b3.Length > 0 && (b3[0] & 0xC0) == 0x80)
{
b3 = b3[1..];
}
Console.WriteLine(Encoding.UTF8.GetString(b3));
Alternatively you can use a decoder with a DecoderReplacementFallback
to replace all invalid byte sequences with a replacement character, and then strip off the replacement character. Encoding.UTF8
uses a replacement character by default.
If you want to get a string out, just:
var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];
var result = Encoding.UTF8.GetString(b3).TrimStart('\uFFFD');
Console.WriteLine(result);
If you want to get the index of the first valid character without going via a string, I guess you could do something like (not particularly well-tested):
var b = new Span<byte>(Encoding.UTF8.GetBytes("你好啊,兄弟们哼哼哼😂"));
var b3 = b[2..];
var decoder = Encoding.UTF8.GetDecoder();
Span<char> chars = stackalloc char[1];
int pos = 0;
while (pos < b3.Length)
{
decoder.Convert(b3[pos..], chars, false, out int bytesUsed, out int charsUsed, out bool completed);
if (completed || (charsUsed > 0 && chars[0] != '\uFFFD'))
{
break;
}
pos += bytesUsed;
}
Console.WriteLine(Encoding.UTF8.GetString(b3[pos..]));
Upvotes: 2