Ilya Loskutov
Ilya Loskutov

Reputation: 2201

How does UTF-16 encoding use surrogate code points?

According to the Unicode specification

D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair.

The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").

However, the very term "scalar value" is defined as follows:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?

Upvotes: 4

Views: 1344

Answers (2)

Ilya Loskutov
Ilya Loskutov

Reputation: 2201

Just for the sake of ultimate clarification.

UTF-16 uses 16-bits (2-bytes) code units. It means this encoding format encodes code points (= abstract ideas should be represented in computer memory in some way), as a rule, into 16 bits (so an interpreter presumably reads data as two bytes at a time).

UTF-16 does its best quite straightforward: the U+000E code point would be encoded as 0x000E, U+000F as 0x000F, and so on.

The issue is that 16 bits are not sufficient to accommodate all unicode code points (the 0x0000 - 0xFFFF range allows of only 65 536 possible values). We might use two 16-bits words (4 bytes) for the code points beyond these boundaries (actually, my misunderstanding was about why UTF-16 doesn't do so). However, this approach results in a bitter inability to decode some values. For example, if we encode the U+10000 code point into 0x00010000 how on earth the interpreter should decode such representation: as two different successive code points, U+0001 and U+0000, or as a single one, U+10000?

The Unicode specification decided on a better way. If there is a need to encode the U+10000 - U+10FFFF range (there are 1 048 576 code points by the way) then we should, first, set apart 1 024 + 1 024 = 2 048 values from one byte range (the spec chose the 0xD800 - 0xDFFF ones for these purposes). And when the interpreter encounters a 0xD800 - 0xDBFF value it knows there is no implied "full-fledged" code point encoded here (no scalar value in terms of the spec) and it should then read another 16 bits to get a value from the 0xDC00 - 0xDFFF range and finally conclude which of the U+10000 - U+10FFFF code points was encoded with these 4 bytes. Note this scheme makes possible to encode 1 024 * 1 024 = 1 048 576 code points (and that's the very number we need).

The Unicode spec just includes this integer range in its Unicode code space, which is a space of code points from U+0000 to 0x10FFFF. Then the U+D800 - U+DBFF subspace is called High Surrogate Area and U+DC00 - U+DFFF Low Surrogate Area therein. Since the values of these code points match those of UTF-16 code units (= the specific representations of code points), this inclusion can be viewed as a UTF-16 relic:

The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character. [spec]

Upvotes: 2

Mark Tolonen
Mark Tolonen

Reputation: 178409

Yes, Unicode reserves ranges for surrogate code points:

Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.

Upvotes: 3

Related Questions