CrazySynthax
CrazySynthax

Reputation: 15058

UTF-8: How can the reader know how many bytes a character counts?

UTF-8 can represent each character by one byte or more. Let's suppose that I have the following byte sequence:

48 65

How can I know if it's one character represented by 48 and another character represented by 65, or it's ONE character represented by a combination of TWO bytes 48 65?

Upvotes: 6

Views: 2254

Answers (1)

user3942918
user3942918

Reputation: 26415

UTF-8 was designed in such a way as to be unambiguous. Neither 0x48 or 0x65, or anything else under 0x80, are ever part of a multi-byte sequence.

The most significant bits of the first byte of a UTF-8 encoded code point will tell you how many bytes are used for it. This should be clear from the UTF-8 Bit Distribution Table:

Scalar Value                First Byte  Second Byte Third Byte  Fourth Byte
00000000 0xxxxxxx           0xxxxxxx            
00000yyy yyxxxxxx           110yyyyy    10xxxxxx        
zzzzyyyy yyxxxxxx           1110zzzz    10yyyyyy    10xxxxxx    
000uuuuu zzzzyyyy yyxxxxxx  11110uuu    10uuzzzz    10yyyyyy    10xxxxxx

So, the worst case scenario is you jump in mid string somewhere and see a byte whose most significant bits are 1 then 0 (everything from 0x80 through 0xBF), which says it's a continuation byte. In that case, you'd have to backtrack a maximum of 3 bytes in order to determine the full sequence.

Upvotes: 6

Related Questions