Audun Olsen
Audun Olsen

Reputation: 628

Unicode surrogate pairs and String.fromCodePoint() — JavaScript

I'm dealing with raw strings containing escape sequences for surrogate halves of UTF astral symbols. (I think I got that lingo right…)

console.log("\uD83D\uDCA9")
// => 💩

Let's use the above emoji as an example. If I have the surrogate pair (\uD83D\uDCA9) How can I in turn take it's hexadecimal values and turn it into a valid argument for Javascript's String.fromCodePoint() function?

I've tried the following:

const codePoint = ["D83D", "DCA9"].reduce((acc, cur) => {
    return acc += parseInt(cur, 16);
}, 0);

console.log(String.fromCodePoint(codePoint));
// => 𛓦 (some weird symbol appears, not 💩!)

PS: I'm familiar with ES6 escape sequences which show hexadecimal values between brackets {…} instead of using surrogate halves. But I need to do this with surrogate pairs!

Any suggestions are greatly appreciated.

Upvotes: 4

Views: 1835

Answers (2)

Pointy
Pointy

Reputation: 413702

You can pass a list of values to the function:

console.log(String.fromCodePoint(0xd83d, 0xdca9));

Thus a "valid argument" for String.fromCodePoint() is not necessarily a single value, and indeed for a character that requires a surrogate pair it by definition cannot be a single value. Why? Because each individual numeric source value, as far as String.fromCodePoint() is concerned, must be a 16-bit (2-byte) value. If you could pass bigger single numbers, there would be no need for surrogate pairs!

Edit: much of the above paragraph is inaccurate; the .fromCodePoint() method will accept full Unicode code point values (greater than 16 bits). Of course it still has to split them into surrogate pairs because JavaScript strings are UTF-16, but what it means is that if you happen to have full-size Unicode code points you don't have to split them up yourself, which is nice. However if you do have pairs already, there's really no point combining them yourself because the method also works on the pairs when passed as part of a list of points.

If you have values in an array, you can invoke the function with apply:

var points = [0xd83d, 0xdca9];
console.log(String.fromCodePoint.apply(String, points));

Upvotes: 4

Mr Lister
Mr Lister

Reputation: 46539

The solution by Pointy is correct, but to answer your question what goes wrong with your formula, the problem is that you simply add 0xD83D and 0xDCA9, resulting in 0x1B4E6. But that is not how surrogates work; you should have used the proper formula

( (first - 0xD800) << 10) + (second - 0xDC00) + 0x10000

which can be shortened to

(first - 0xD7F7) << 10) + second

See Unicode encodings.

If you do that, you'll get 0x1F4A9.

const codePoint = ["D83D", "DCA9"].reduce((acc, cur) => {
  cur = parseInt(cur, 16); return acc += cur<0xDC00 ? (cur-0xD7F7)<<10 : cur;
  }, 0);

console.log(String.fromCodePoint(codePoint));
// => now outputs 💩!

Upvotes: 2

Related Questions