Reputation: 86055
If a unicode code point uses 17 bits or more, how are the surrogate pairs calculated?
Upvotes: 5
Views: 5207
Reputation: 189317
Here is a hopefully somewhat more beginner-friendly exposition.
The surrogate code points are in the range 0xD800-0xDF00. The first half of this space is used for the high half of the surrogate and the second half for the low half.
So, to encode U+10000, you split the remainder above 0x10000 into two halves, and cram them into the slots available.
D8 00 DC 00
Similarly, to encode U+10FFFF, you get
DB FF DF FF
In other words, the values from D800 to DBFF have their D800 part masked off, and the remainder is used for the first ten bits of the value we want to encode. Similarly, the values from DC00 to DFFF have DC00 masked off, and the remainder is used for the low ten bits of the encoded value.
By definition, the base for all these code points is 0x10000, so that does not have to be explicitly encoded, just the offset from this base.
U+00010000 = base 0x00010000 + 0x00000
= 0000 0000 0000 0000 0000
mmnn nnnn nnpp qqqq qqqq
U+0010FFFF = base 0x00010000 + 0xFFFFF
= 1111 1111 1111 1111 1111
mmnn nnnn nnpp qqqq qqqq
... where mmnnnnnnnn in hex is xxx and ppqqqqqqqq is yyy
1101 10mm nnnn nnnn D8+xxx 1110 11pp qqqq qqqq DC+yyy
----------------------------- -----------------------------
1101 1000 0000 0000 D800 1110 1100 0000 0000 DC00
1101 1011 1111 1111 DBFF 1110 1111 1111 1111 DFFF
Upvotes: 1
Reputation: 18507
If it is code you are after, here is how a single codepoint is encoded in UTF-16 and UTF-8 respectively.
A single codepoint to UTF-16 codeunits:
if (cp < 0x10000u)
{
*out++ = static_cast<uint16_t>(cp);
}
else
{
*out++ = static_cast<uint16_t>(0xd800u + (((cp - 0x10000u) >> 10) & 0x3ffu));
*out++ = static_cast<uint16_t>(0xdc00u + ((cp - 0x10000u) & 0x3ffu));
}
A single codepoint to UTF-8 codeunits:
if (cp < 0x80u)
{
*out++ = static_cast<uint8_t>(cp);
}
else if (cp < 0x800u)
{
*out++ = static_cast<uint8_t>((cp >> 6) & 0x1fu | 0xc0u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else if (cp < 0x10000u)
{
*out++ = static_cast<uint8_t>((cp >> 12) & 0x0fu | 0xe0u);
*out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else
{
*out++ = static_cast<uint8_t>((cp >> 18) & 0x07u | 0xf0u);
*out++ = static_cast<uint8_t>(((cp >> 12) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
Upvotes: 7
Reputation: 11395
Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.
Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.
This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.
Upvotes: 9