In what circumstances would 32-bits be required in UTF-8 encoding?

Question

From my understanding and what I have been reading around the web, UTF-8 can use 1-4 code units (each a byte in length) to encode all characters from the Unicode character set. What I am wondering is this: since all code points in Unicode can be represented in 21 bits, when would you use 4 code units rather than 3?

You only need 24 bits to represent any Unicode character so when would you use 32 bits in UTF-8 encoding and why? Are extra bits needed to store additional data of some kind?

Mark Tolonen · Accepted Answer

The UTF-8 encoding has overhead. The first byte uses 1-5 bits to indicate how many additional bytes are used, and each additional byte uses 2 bits as a continuation byte marker. Thus, a four-byte UTF-8 sequence requires 5 bits of overhead for the first byte, and 2 bits of overhead per byte for the remaining 3 bytes, leaving 21 bits to encode the codepoint.

1-byte UTF-8, 7 data bits (U+0000 to U+007F): 0xxxxxxx
2-byte UTF-8, 11 data bits (U+0080 to U+07FF): 110xxxxx 10xxxxxx
3-byte UTF-8, 16 data bits (U+0800 to U+FFFF): 1110xxxx 10xxxxxx 10xxxxxx
4-byte UTF-8, 21 data bits (U+10000 to U+10FFFF): 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Ref: UTF-8

In what circumstances would 32-bits be required in UTF-8 encoding?

Answers (1)

Related Questions