Reputation: 3
While studying the Unicode and utf-8 encoding,
I noticed that the 129th Unicode encoded by the utf-8 starts with 0xc2.
I checked the last letter of 0xcf.
No Unicode was 0xc1 encoded as 0xc1.
Why 129th unicode is start at 0xc2 instead of 0xc1?
Upvotes: 0
Views: 1619
Reputation: 52538
UTF-8 starting with 0xc1 would be a Unicode code point in the range 0x40 to 0x7f. 0xc0 would be a Unicode code point in the range 0x00 to 0x3f.
There is an iron rule that every code point is represented in UTF-8 in the shortest possible way. Since all these code points can be stored in a single UTF-8 byte, they are not allowed to be stored using two bytes.
For the same reason you will find that there are no 4-byte codes starting with 0xf0 0x80 to 0xf0 0x8f because they are stored using fewer bytes instead.
Upvotes: 3
Reputation: 177675
The UTF-8 specification, RFC 3629 specifically states in the introduction:
The octet values C0, C1, F5 to FF never appear.
The reason for this is that a 1-byte UTF-8 sequence consists of the 8-bit binary pattern 0xxxxxxx
(a zero followed by seven bits) and can represent Unicode code points that fit in seven bits (U+0000 to U+007F).
A 2-byte UTF-8 sequence consists of the 16-bit binary pattern 110xxxxx 10xxxxxx
and can represent Unicode code points that fit in eight to eleven bits (U+0080 to U+07FF).
It is not legal in UTF-8 encoding to use more bytes that the minimum required, so while U+007F can be represented in two bytes as 11000001 10111111
(C1 BF
hex) it is more compact and therefore follows specification as the 1-byte 01111111
.
The first valid two-byte value is the encoding of U+0080, which is 1100010 10000000
(C2 80
hex), so C0
and C1
will never appear.
See section 3 UTF-8 definition in the standard. The last paragraph states:
Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000....
Upvotes: 5