Reputation: 7470
Why there are no 5-byte or 6-byte code points? I know they were till 2003 when they were removed. But I cannot find why were they removed.
The Wikipedia page on UTF-8 says
In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.
but I don't understand why it's important.
Upvotes: 8
Views: 4170
Reputation: 2807
UPDATE :
to test whether a code point lies within the surrogate range or not, instead of doing 2 relational comparisons, one could also do
floor( N / 2048 ) == 27
(this is identical to 4^5 * 54-or-55
, terms re-arranged)
===========================================
right now the space is allocated for 4^8 + 4^10
code points (CP), i.e. 1,114,112
, but barely 1/4 to 1/3rd of that is assigned to anything.
so unless there's a sudden need to add in another 750k CPs in a very short duration, up to 4 bytes for UTF-8 should be more than enough for years to come.
** just personal preference for
4^8 + 4^10
on top of clarity and simplicity, it also clearly delineates the CPs by UTF-8 byte count
:
4 ^ 8 = 65,536 = all CPs for 1-, 2-, or 3-bytes UTF-8
4 ^ 10 = 1,048,576 = all CPs for 4-bytes UTF-8
instead of something unseemly like
2^16 * 17
or worse,
32^4 + 16^4
*** unrelated sidenote : *the cleanest formula-triplet I managed to conjure up for the starting points of the UTF-16 surrogates
are :: *
4^5 * 54 = 55,296 = 0x D800 = High - surrogates
4^5 * 55 = 56,320 = 0x DC00 = Low - surrogates
4^5 * 56 = 57,344 = 0x E000 = just beyond the upper-boundary of 0x DFFF
Upvotes: 1
Reputation: 1695
I’ve heard some reasons, but did’t find any of them convincing. Basically, the stupid reason is: UTF-16 was specified before UTF-8 and at that time, 20 bits of storage for characters (yielding 2²⁰+2¹⁶ caracters minus a little like non-characters and surrogates for management) were deemed enough.
UTF-8 and UTF-16 are already variable-length encodings that, as you said for UTF-8, could be extended without big hastle (use 5- and 6-byte words). Extending UTF-32 to include 21 to 31 bits is trivial (32 could be a problem due to signdness), but making it variable-length defeats the use-case of UTF-32 completely.
Extending UTF-16 is hard, but I’ll try. Look at what UTF-8 does in a 2-byte sequence: The initial 110yyyyy
acts like a high surrogate and 10zzzzzz
like a low surrogate. For UTF-16, flip it around and re-use high surrogates as “initial surrogates” and low surrogates as “continue surrogates”. So, basically, you can have multiple low surrogates.
There’s a problem, though: Unicode streams are supposed to resist misinterpretation when you’re “tuning in” or the sender is “tuning out”.
11100010 10000010
, you know for sure the stream is incomplete. 1110
tells you: This is a 3-byte word, but one is still missing. In the suggested “extended UTF-16”, there’s nothing like that.The “tuning out” can be solved by using U+10FFFE as an announcement for a single UTF-32 encoding. If the stream stops after U+10FFFE, you know you’re missing something, same goes for an incomplete UTF-32. And if it stops in the middle of the U+10FFFE, it’s lacking a low surrogate. But that does not work becasue “tuning in” to the UTF-32 encoding can mislead you.
What could be utilized are so-called non-characters (the most well-known would be the reverse of the byte order mark) at the end of plane 16: Encode U+10FFFE and U+10FFFF using existing surrogates to announce a 3- or 4-byte sequence, repectively. This is very wasteful: 32 bits are used for the announcement alone, 48 or 64 additional bits are used for the actual encoding. However, it is still better than, say using U+10FFFE and U+10FFFF around a single UTF-32 encoding.
Maybe there’s something flawed in this reasoning. This is an argument of the sort: This is hard and I’ll prove it by trying and showing where the traps are.
Upvotes: 1
Reputation: 179779
Because there are no Unicode characters which would require them. And these cannot be added either because they'd be impossible to encode with UTF-16 surrogates.
Upvotes: 11