조재현
조재현

Reputation: 3

Why is there no Unicode starting with 0xC1?

While studying the Unicode and utf-8 encoding,

I noticed that the 129th Unicode encoded by the utf-8 starts with 0xc2.

I checked the last letter of 0xcf.

No Unicode was 0xc1 encoded as 0xc1.

Why 129th unicode is start at 0xc2 instead of 0xc1?

Upvotes: 0

Views: 1619

Answers (2)

gnasher729
gnasher729

Reputation: 52538

UTF-8 starting with 0xc1 would be a Unicode code point in the range 0x40 to 0x7f. 0xc0 would be a Unicode code point in the range 0x00 to 0x3f.

There is an iron rule that every code point is represented in UTF-8 in the shortest possible way. Since all these code points can be stored in a single UTF-8 byte, they are not allowed to be stored using two bytes.

For the same reason you will find that there are no 4-byte codes starting with 0xf0 0x80 to 0xf0 0x8f because they are stored using fewer bytes instead.

Upvotes: 3

Mark Tolonen
Mark Tolonen

Reputation: 177675

The UTF-8 specification, RFC 3629 specifically states in the introduction:

The octet values C0, C1, F5 to FF never appear.

The reason for this is that a 1-byte UTF-8 sequence consists of the 8-bit binary pattern 0xxxxxxx (a zero followed by seven bits) and can represent Unicode code points that fit in seven bits (U+0000 to U+007F).

A 2-byte UTF-8 sequence consists of the 16-bit binary pattern 110xxxxx 10xxxxxx and can represent Unicode code points that fit in eight to eleven bits (U+0080 to U+07FF).

It is not legal in UTF-8 encoding to use more bytes that the minimum required, so while U+007F can be represented in two bytes as 11000001 10111111 (C1 BF hex) it is more compact and therefore follows specification as the 1-byte 01111111.

The first valid two-byte value is the encoding of U+0080, which is 1100010 10000000 (C2 80 hex), so C0 and C1 will never appear.

See section 3 UTF-8 definition in the standard. The last paragraph states:

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000....

Upvotes: 5

Related Questions