Reputation: 55729
Unicode characters can be identified by different numbers.
For example, the "Face with Medical Mask" emoji can be identified by the descriptor U+1F637
or U+E40C
.
I presume these descriptors identify the index of the character in the complete table of Unicode characters: but why are there two of them?
In UTF-16 this Unicode code point can be represented as four bytes, forming two 16-bit code units (I think):
D83D followed by DE37
console.log('\uD83D\uDE37') // prints š·
How can I get from the Unicode descriptor to the binary representation of the character and then to the UTF-16 representation?
Upvotes: 1
Views: 147
Reputation: 338730
You asked:
but why are there two of them?
There are not two. There is one officially designated character (U+1F637
), the other (U+E40C
) is a āprivate useā number that can be unofficially assigned by anyone to any character.
Unicode code points use a range of over a million numbers.
The numbers in the private areas can be used by any parties who agree upon their semantics. Any person can assign any character they want to any number in a private range. After making their own private agreement, those parties can safely exchange data using those code points knowing they will never suddenly be reinterpreted by future software as official characters.
Why would anybody do this? The parties might be academics researching and documenting some obscure language not yet recognized by the Unicode Consortium. Or they might be fans of a fictional language like Klingon that does not meet the requirements for official inclusion in Unicode. Or they might be people who want to invent a new emoji unofficially. In all these cases, the parties using the private areas need to implement a font with glyphs for their unofficial characters.
Some people outside the Unicode Consortium have coordinated efforts to publicly assign characters not covered by Unicode to various ranges within the Private Use Areas. They may publish a registry to make other aware. But such assignments are not official of course, and compliance optional.
Your number U+E40C
(decimal 58,380) is from a Private Use Area number range. That character may have been commonly used by various people as the face mask emoji in the olden days. But that number was never assigned officially by the Unicode Consortium. Nor will it ever be assigned, because it is set aside for private use only.
U+1F637
= FACE WITH MEDICAL MASK
= š·U+1F637 (decimal 128,567) was officially designated by the Unicode Consortium in Unicode 6.0 in 2010 as FACE WITH MEDICAL MASK.
You asked:
How can I get from the Unicode descriptor to the binary representation of the character and then to the UTF-16 representation?
To encode this number, see the Answer by Ben.
Upvotes: 1
Reputation: 55729
The character "Face with Medical Mask" š· is code point U+1F637
.
In binary this is: 1 1111 0110 0011 0111
.
To encode this in UTF-16 you need to do the following:
0x10000
is subtracted from the code pointĀ 0xD800
to give the first 16-bitĀ code unit0xDC00
to give the second 16-bitĀ code unitconst codepoint = 0b11111011000110111 // š·
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;
const ch = String.fromCharCode(unit1) + String.fromCharCode(unit2);
console.log(ch);
Upvotes: 1