Reputation: 55729
The following calculates the UTF-16 surrogate pair for a Unicode codepoint (Face with Medical Mask).
But how can I then construct the character from the surrogate pair, for use in a string?
const codepoint = 0b11111011000110111 // 😷
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = (Number.parseInt(padded.substr(0, 10), 2) + 0xD800).toString(16)
const unit2 = (Number.parseInt(padded.substr(10), 2) + 0xDC00).toString(16)
// obviously hard-coding the values works...
console.log(`Hard-coded: \ud83d\ude37`)
// ...but how to combine unit1 and unit2 to print the character?
console.log(`Dynamic: unit1: ${unit1}, unit2: ${unit2}`)
Upvotes: 2
Views: 587
Reputation: 1074008
Two answers for you:
In a modern JavaScript environment you don't have to split the code point apart, you can use String.fromCodePoint
to create the character directly:
const ch = String.fromCodePoint(codepoint);
Live Example:
const codepoint = 0b11111011000110111; // 😷
const ch = String.fromCodePoint(codepoint);
console.log(ch);
If you don't have fromCodePoint
or you have the surrogates as your starting point, you can get the string version of each surrogate via fromCharCode
— but don't do toString(16)
, leave the units as numbers:
const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;
const ch = String.fromCharCode(unit1, unit2);
Live Example:
const codepoint = 0b11111011000110111 // 😷
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;
const ch = String.fromCharCode(unit1, unit2);
console.log(ch);
You could even do it as
const ch = String.fromCharCode(unit1) + String.fromCharCode(unit2);
...but since fromCharCode
accepts multiple char codes (code units), it probably makes more sense to pass both of them to it at once.
The fact it works with each in isolation (String.fromCharCode(unit1) + String.fromCharCode(unit2)
) may seem really, really weird. "You mean String.fromCharCode
happily creates a string with just half of a surrogate pair?!" Yup. :-) JavaScript strings are sequences of UTF-16 code units but they tolerate invalid or orphaned surrogate pairs. From the spec:
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 253 - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value. ...
ECMAScript operations that do not interpret String contents apply no further semantics. Operations that do interpret String values treat each element as a single UTF-16 code unit. However, ECMAScript does not restrict the value of or relationships between these code units, so operations that further interpret String contents as sequences of Unicode code points encoded in UTF-16 must account for ill-formed subsequences. ...
(my emphasis)
Upvotes: 4