Reputation: 15058
UTF-8 can represent each character by one byte or more. Let's suppose that I have the following byte sequence:
48 65
How can I know if it's one character represented by 48
and another character represented by 65
, or it's ONE character represented by a combination of TWO bytes 48 65
?
Upvotes: 6
Views: 2254
Reputation: 26415
UTF-8 was designed in such a way as to be unambiguous. Neither 0x48 or 0x65, or anything else under 0x80, are ever part of a multi-byte sequence.
The most significant bits of the first byte of a UTF-8 encoded code point will tell you how many bytes are used for it. This should be clear from the UTF-8 Bit Distribution Table:
Scalar Value First Byte Second Byte Third Byte Fourth Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx
So, the worst case scenario is you jump in mid string somewhere and see a byte whose most significant bits are 1 then 0 (everything from 0x80 through 0xBF), which says it's a continuation byte. In that case, you'd have to backtrack a maximum of 3 bytes in order to determine the full sequence.
Upvotes: 6