Why does UTF-8 waste so many bits?

Question

If you look at the table here which shows the byte layout of UTF-8, it looks quite wasteful!

A 6-byte char has 17 hardcoded bits! If we just set the first bit of each byte to 1 to indicate "the next byte is part of the same char" then we would only require 6 bits:

1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 1xxxxxxx 0xxxxxxx

And it would still be backwards-compatible with ASCII! And we wouldn't be capped at 6 bytes either.

So why is UTF-8 wasteful? Surely there must be reason that I'm not seeing. Moreover, it seems like there's enough information in that first byte that we don't even need the 10 header on each of the remaining bytes. We could have done

0xxxxxxx
10xxxxxx xxxxxxxx
110xxxxx xxxxxxxx xxxxxxxx
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
11111xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

And that would have worked too, no? Or we could support even more bytes with a different scheme.

Does it have to do with how those single bytes will be displayed if UTF-8 isn't properly supported? What does 10 afford? And is the trade-off worth it? If I try to render UTF-8 encoded Japanese characters in an ASCII-only program, I'm going to get garbage either way aren't I?

Stefan Haustein · Accepted Answer

The reason for this redundancy is to make UTF-8 self-synchronizing: the symbol stream formed by a portion of one code word, or by the overlapped portion of any two adjacent code words, is not a valid code word. See https://en.wikipedia.org/wiki/Self-synchronizing_code and https://en.wikipedia.org/wiki/UTF-8#History

Why does UTF-8 waste so many bits?

Answers (1)

Related Questions