Ben Aston
Ben Aston

Reputation: 55729

Conversion between various numbers identifying a Unicode character

Unicode characters can be identified by different numbers.

For example, the "Face with Medical Mask" emoji can be identified by the descriptor U+1F637 or U+E40C.

I presume these descriptors identify the index of the character in the complete table of Unicode characters: but why are there two of them?

In UTF-16 this Unicode code point can be represented as four bytes, forming two 16-bit code units (I think):

D83D followed by DE37

console.log('\uD83D\uDE37') // prints šŸ˜·

How can I get from the Unicode descriptor to the binary representation of the character and then to the UTF-16 representation?

Upvotes: 1

Views: 147

Answers (2)

Basil Bourque
Basil Bourque

Reputation: 338730

Private Use Area in Unicode

You asked:

but why are there two of them?

There are not two. There is one officially designated character (U+1F637), the other (U+E40C) is a ā€œprivate useā€ number that can be unofficially assigned by anyone to any character.

Unicode code points use a range of over a million numbers.

  • Only a tenth, over 113,000 of those have been assigned officially to a specific character.
  • Broad ranges are reserved for Private Use Areas (PUA). Like a nature preserve sets aside property with the intent that it never be developed, these private ranges of numbers will never be officially assigned to a character.
  • The rest of the million numbers are simply unassigned, waiting to some day be assigned a character officially by the Unicode Consortium.

The numbers in the private areas can be used by any parties who agree upon their semantics. Any person can assign any character they want to any number in a private range. After making their own private agreement, those parties can safely exchange data using those code points knowing they will never suddenly be reinterpreted by future software as official characters.

Why would anybody do this? The parties might be academics researching and documenting some obscure language not yet recognized by the Unicode Consortium. Or they might be fans of a fictional language like Klingon that does not meet the requirements for official inclusion in Unicode. Or they might be people who want to invent a new emoji unofficially. In all these cases, the parties using the private areas need to implement a font with glyphs for their unofficial characters.

Some people outside the Unicode Consortium have coordinated efforts to publicly assign characters not covered by Unicode to various ranges within the Private Use Areas. They may publish a registry to make other aware. But such assignments are not official of course, and compliance optional.

Your number U+E40C (decimal 58,380) is from a Private Use Area number range. That character may have been commonly used by various people as the face mask emoji in the olden days. But that number was never assigned officially by the Unicode Consortium. Nor will it ever be assigned, because it is set aside for private use only.

U+1F637 = FACE WITH MEDICAL MASK = šŸ˜·

U+1F637 (decimal 128,567) was officially designated by the Unicode Consortium in Unicode 6.0 in 2010 as FACE WITH MEDICAL MASK.

Encoding

You asked:

How can I get from the Unicode descriptor to the binary representation of the character and then to the UTF-16 representation?

To encode this number, see the Answer by Ben.

Upvotes: 1

Ben Aston
Ben Aston

Reputation: 55729

The character "Face with Medical Mask" šŸ˜· is code point U+1F637.

In binary this is: 1 1111 0110 0011 0111.

To encode this in UTF-16 you need to do the following:

  1. 0x10000 is subtracted from the code pointĀ 
  2. The high ten bits are added to 0xD800 to give the first 16-bitĀ code unit
  3. The low ten bits are added to 0xDC00 to give the second 16-bitĀ code unit

const codepoint = 0b11111011000110111 // šŸ˜·
const tmp = codepoint - 0x10000
const padded = tmp.toString(2).padStart(20, '0')
const unit1 = Number.parseInt(padded.substr(0, 10), 2) + 0xD800;
const unit2 = Number.parseInt(padded.substr(10), 2) + 0xDC00;

const ch = String.fromCharCode(unit1) + String.fromCharCode(unit2);

console.log(ch);

Upvotes: 1

Related Questions