How can I construct a UTF-16 character in JavaScript from the surrogate pair?

Question

The following calculates the UTF-16 surrogate pair for a Unicode codepoint (Face with Medical Mask).

But how can I then construct the character from the surrogate pair, for use in a string?

const codepoint = 0b11111011000110111 // 😷
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = (Number.parseInt(padded.substr(0, 10), 2) + 0xD800).toString(16)
const unit2 = (Number.parseInt(padded.substr(10), 2) + 0xDC00).toString(16)

// obviously hard-coding the values works...
console.log(`Hard-coded: \ud83d\ude37`)
// ...but how to combine unit1 and unit2 to print the character?
console.log(`Dynamic: unit1: ${unit1}, unit2: ${unit2}`)

T.J. Crowder · Accepted Answer

Two answers for you:

You may not have to

In a modern JavaScript environment you don't have to split the code point apart, you can use String.fromCodePoint to create the character directly:

const ch = String.fromCodePoint(codepoint);

Live Example:

const codepoint = 0b11111011000110111; // 😷
const ch = String.fromCodePoint(codepoint);

console.log(ch);

You can build it from parts

If you don't have fromCodePoint or you have the surrogates as your starting point, you can get the string version of each surrogate via fromCharCode — but don't do toString(16), leave the units as numbers:

const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;

const ch = String.fromCharCode(unit1, unit2);

Live Example:

const codepoint = 0b11111011000110111 // 😷
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;

const ch = String.fromCharCode(unit1, unit2);

console.log(ch);

You could even do it as

const ch = String.fromCharCode(unit1) + String.fromCharCode(unit2);

...but since fromCharCode accepts multiple char codes (code units), it probably makes more sense to pass both of them to it at once.

The fact it works with each in isolation (String.fromCharCode(unit1) + String.fromCharCode(unit2)) may seem really, really weird. "You mean String.fromCharCode happily creates a string with just half of a surrogate pair?!" Yup. :-) JavaScript strings are sequences of UTF-16 code units but they tolerate invalid or orphaned surrogate pairs. From the spec:

The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 2⁵³ - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value. ...

ECMAScript operations that do not interpret String contents apply no further semantics. Operations that do interpret String values treat each element as a single UTF-16 code unit. However, ECMAScript does not restrict the value of or relationships between these code units, so operations that further interpret String contents as sequences of Unicode code points encoded in UTF-16 must account for ill-formed subsequences. ...

(my emphasis)

How can I construct a UTF-16 character in JavaScript from the surrogate pair?

Answers (1)

You may not have to

You can build it from parts

Related Questions