Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

Question

According to https://en.wikipedia.org/wiki/UTF-8, the first byte of the encoding of a character never start with bit patterns of neither 10xxxxxx nor 11111xxx. The reason for the first one is obvious: auto-synchronization. But how about the second? Is it for something like potential extension to enable 5-bytes encoding?

Rob Napier · Accepted Answer

Older versions of UTF-8 allowed up to 6-byte encodings. It was later restricted to 4-byte encodings, but there's no reason to make the format inconsistent in order to achieve that restriction. The number of leading 1s indicates the length of the sequence, so 11111xxx still means "at least 5 bytes," there just are no such legal sequences.

Having illegal code points is very useful in detecting corruption (or more commonly, attempts to decode data that is not actually UTF-8). So making the format inconsistent just to get back one bit of storage (which couldn't actually be used for anything), would hurt other goals.

Why UTF-8 encoding does not use bytes of the form 11111xxx as the first byte?

Answers (1)

Related Questions