Reputation: 447
I'm reading UTF-8 Encoding, and I don't understand the following sentence.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes. The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).
If I'm not mistaken, this means UTF-8 requires two bytes to represent 2048 characters. In other words, we need to choose 2048 candidates from 2 to the power of 16 to represent each character.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes.
What's the big deal about choosing 2048 out of 65,536? However, UTF-8 explicitly sets boundary to each byte.
With following statements, The number of combinations is 30 (0xDF - 0xC2 + 0x01) for first byte, and 64 (0xBF - 0x80 + 0x01) for second byte.
The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).
How does 1920 numbers (64 times 30) accommodate 2048 combinations?
Upvotes: 1
Views: 1724
Reputation: 189387
Think of it as a container, where the encoding reserves a few bits for its own synchronization, and you get to use the remaining bits.
So for the range in question, the encoding "template" is
110 abcde 10 fghijk
(where I have left a single space to mark the boundary between the template and the value from the code point we want to encode, and two spaces between the actual bytes)
and you get to use the 11 bits abcdefghijk
for the value you actually want to transmit.
So for the code point U+07EB you get
0x07 00000111
0xEB 11101011
where the top five zero bits are masked out (remember, we only get 11 -- because the maximum value that the encoding can accommodate in two bytes is 0x07FF. If you have a larger value, the encoding will use a different template, which is three bytes) and so
0x07 = _____ 111 (template: _____ abc)
0xEB = 11 101011 (template: de fghijk)
abc de = 111 11 (where the first three come from 0x07, and the next two from 0xEB)
fghijk = 101011 (the remaining bits from 0xEB)
yielding the value
110 11111 10 101011
aka 0xDF 0xAB.
Wikipedia's article on UTF-8 contains more examples with nicely colored numbers to see what comes from where.
The range 0x00-0x7F, which can be represented in a single byte, contains 128 code points; the two-byte range thus needs to accommodate 1920 = 2048-128 code points.
The raw encoding would allow values in the range 0xC0-0xBF in the first byte, but the values 0xC0 and 0xC1 are not ever needed because those would represent code points which can be represented in a single byte, and thus are invalid as per the encoding spec. In other words, the 0x02 in 0xC2 comes from the fact that at least one bit in the high four bits out of the 11 that this segment of the encoding can represent (one of abcd
) needs to be a one bit in order for the value to require two bytes.
Upvotes: 3
Reputation: 596206
As you already know, 2047 (0x07FF) contains the raw bits
00000111 11111111
If you look at the bit distribution chart for UTF-8:
You will see that 0x07FF falls in the second line, so it is encoded as 2 bytes using this bit pattern:
110xxxxx 10xxxxxx
Substitute the raw bits into the x
s and you get this result:
11011111 10111111 (0xDF 0xBF)
Which is exactly as the description you quoted says:
The first byte will have the two high bits set and the third bit clear (11011111). The second byte will have the top bit set and the second bit clear (10111111).
Upvotes: 4