user496949
user496949

Reputation: 86055

How are surrogate pairs calculated?

If a unicode code point uses 17 bits or more, how are the surrogate pairs calculated?

Upvotes: 5

Views: 5207

Answers (3)

tripleee
tripleee

Reputation: 189317

Here is a hopefully somewhat more beginner-friendly exposition.

The surrogate code points are in the range 0xD800-0xDF00. The first half of this space is used for the high half of the surrogate and the second half for the low half.

So, to encode U+10000, you split the remainder above 0x10000 into two halves, and cram them into the slots available.

D8 00 DC 00

Similarly, to encode U+10FFFF, you get

DB FF DF FF

In other words, the values from D800 to DBFF have their D800 part masked off, and the remainder is used for the first ten bits of the value we want to encode. Similarly, the values from DC00 to DFFF have DC00 masked off, and the remainder is used for the low ten bits of the encoded value.

By definition, the base for all these code points is 0x10000, so that does not have to be explicitly encoded, just the offset from this base.

U+00010000 = base 0x00010000 + 0x00000
= 0000 0000 0000 0000 0000
  mmnn nnnn nnpp qqqq qqqq

U+0010FFFF = base 0x00010000 + 0xFFFFF
= 1111 1111 1111 1111 1111
  mmnn nnnn nnpp qqqq qqqq

... where mmnnnnnnnn in hex is xxx and ppqqqqqqqq is yyy

1101 10mm  nnnn nnnn   D8+xxx   1110 11pp  qqqq qqqq   DC+yyy
-----------------------------   -----------------------------
1101 1000  0000 0000   D800     1110 1100  0000 0000   DC00
1101 1011  1111 1111   DBFF     1110 1111  1111 1111   DFFF

Upvotes: 1

dalle
dalle

Reputation: 18507

If it is code you are after, here is how a single codepoint is encoded in UTF-16 and UTF-8 respectively.

A single codepoint to UTF-16 codeunits:

if (cp < 0x10000u)
{
   *out++ = static_cast<uint16_t>(cp);
}
else
{
   *out++ = static_cast<uint16_t>(0xd800u + (((cp - 0x10000u) >> 10) & 0x3ffu));
   *out++ = static_cast<uint16_t>(0xdc00u + ((cp - 0x10000u) & 0x3ffu));
}

A single codepoint to UTF-8 codeunits:

if (cp < 0x80u)
{
   *out++ = static_cast<uint8_t>(cp);
}
else if (cp < 0x800u)
{
   *out++ = static_cast<uint8_t>((cp >> 6) & 0x1fu | 0xc0u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else if (cp < 0x10000u)
{
   *out++ = static_cast<uint8_t>((cp >> 12) & 0x0fu | 0xe0u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else
{
   *out++ = static_cast<uint8_t>((cp >> 18) & 0x07u | 0xf0u);
   *out++ = static_cast<uint8_t>(((cp >> 12) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
   *out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}

Upvotes: 7

Jim DeLaHunt
Jim DeLaHunt

Reputation: 11395

Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.

Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.

  • Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
  • Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
  • Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.

This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.

Upvotes: 9

Related Questions