graywolf
graywolf

Reputation: 7470

Why there are no 5-byte and 6-byte code points in UTF-8?

Why there are no 5-byte or 6-byte code points? I know they were till 2003 when they were removed. But I cannot find why were they removed.

The Wikipedia page on UTF-8 says

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

but I don't understand why it's important.

Upvotes: 8

Views: 4170

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2807

UPDATE :

to test whether a code point lies within the surrogate range or not, instead of doing 2 relational comparisons, one could also do

 floor( N / 2048 ) == 27

(this is identical to 4^5 * 54-or-55, terms re-arranged)

===========================================

right now the space is allocated for 4^8 + 4^10 code points (CP), i.e. 1,114,112, but barely 1/4 to 1/3rd of that is assigned to anything.

so unless there's a sudden need to add in another 750k CPs in a very short duration, up to 4 bytes for UTF-8 should be more than enough for years to come.

** just personal preference for

  • 4^8 + 4^10

on top of clarity and simplicity, it also clearly delineates the CPs by UTF-8 byte count :

 4 ^  8 =    65,536 = all CPs for 1-, 2-, or 3-bytes UTF-8
 4 ^ 10 = 1,048,576 = all CPs for            4-bytes UTF-8

instead of something unseemly like

 2^16 *  17

or worse,

 32^4 + 16^4

*** unrelated sidenote : *the cleanest formula-triplet I managed to conjure up for the starting points of the UTF-16 surrogates are :: *

 4^5 * 54 = 55,296 = 0x D800 = High - surrogates
 4^5 * 55 = 56,320 = 0x DC00 = Low  - surrogates
 4^5 * 56 = 57,344 = 0x E000 = just beyond the upper-boundary of 0x DFFF

Upvotes: 1

Bolpat
Bolpat

Reputation: 1695

I’ve heard some reasons, but did’t find any of them convincing. Basically, the stupid reason is: UTF-16 was specified before UTF-8 and at that time, 20 bits of storage for characters (yielding 2²⁰+2¹⁶ caracters minus a little like non-characters and surrogates for management) were deemed enough.

UTF-8 and UTF-16 are already variable-length encodings that, as you said for UTF-8, could be extended without big hastle (use 5- and 6-byte words). Extending UTF-32 to include 21 to 31 bits is trivial (32 could be a problem due to signdness), but making it variable-length defeats the use-case of UTF-32 completely.

Extending UTF-16 is hard, but I’ll try. Look at what UTF-8 does in a 2-byte sequence: The initial 110yyyyy acts like a high surrogate and 10zzzzzz like a low surrogate. For UTF-16, flip it around and re-use high surrogates as “initial surrogates” and low surrogates as “continue surrogates”. So, basically, you can have multiple low surrogates.

There’s a problem, though: Unicode streams are supposed to resist misinterpretation when you’re “tuning in” or the sender is “tuning out”.

  • In UTF-8, if you read a stream of bytes and it ends with 11100010 10000010, you know for sure the stream is incomplete. 1110 tells you: This is a 3-byte word, but one is still missing. In the suggested “extended UTF-16”, there’s nothing like that.
  • In UTF-16, if you read a stream of bytes and it ends with a high surrogate, you know for sure the stream is incomplete.

The “tuning out” can be solved by using U+10FFFE as an announcement for a single UTF-32 encoding. If the stream stops after U+10FFFE, you know you’re missing something, same goes for an incomplete UTF-32. And if it stops in the middle of the U+10FFFE, it’s lacking a low surrogate. But that does not work becasue “tuning in” to the UTF-32 encoding can mislead you.

What could be utilized are so-called non-characters (the most well-known would be the reverse of the byte order mark) at the end of plane 16: Encode U+10FFFE and U+10FFFF using existing surrogates to announce a 3- or 4-byte sequence, repectively. This is very wasteful: 32 bits are used for the announcement alone, 48 or 64 additional bits are used for the actual encoding. However, it is still better than, say using U+10FFFE and U+10FFFF around a single UTF-32 encoding.

Maybe there’s something flawed in this reasoning. This is an argument of the sort: This is hard and I’ll prove it by trying and showing where the traps are.

Upvotes: 1

MSalters
MSalters

Reputation: 179779

Because there are no Unicode characters which would require them. And these cannot be added either because they'd be impossible to encode with UTF-16 surrogates.

Upvotes: 11

Related Questions