Nathan Hinchey
Nathan Hinchey

Reputation: 1201

Python 2.7: Names of unicode representations

What are the names for these different kinds of ascii representations of unicode?

And is there a term for the set that they belong to that's more specific than "representation"? And in the context of these, how would I describe the non-ascii representation (😢)?

Since I don't know what to call them it is very hard to search for how to work with them.

Thanks!

Upvotes: 1

Views: 205

Answers (2)

Tom Blodget
Tom Blodget

Reputation: 20802

For Python 3

First there seems to be a misunderstanding about the hex escapes:

print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")

and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):

print("\xF0\x9F\x98\xA2" == "\360\237\230\242")

It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.

We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.

print("\ud83d\ude22" == "\U0000d83d\U0000de22")

and for completeness:

print("\U0001f622" == "😢")
print("\N{CRYING FACE}" == "😢")

In other languages (where they would be two UTF-16 code units), "\ud83d\ude22" would equal "😢".

Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.


As @Robᵩ points out, you can have a bytestring literal, too:

print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")

Upvotes: 1

Mantas Kandratavičius
Mantas Kandratavičius

Reputation: 1508

As Tom Blodget already warned you, this is a somewhat python specific answer.


The leading \ shows that it's an escape sequence.

\x means that the next two characters will be interpreted as a hex digit.

\U means that the next eight characters will be interpreted as a 32-bit hex value.

You can read more about that here:

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

To fully answer your question:

  • \xF0\x9F\x98\xA2 are simply four ASCII characters and you have their hex values
  • \U0001f622 is a UNICODE codepoint encoded with a 32-bit hex value
  • 😢 is a glyph or simply a special character.

Upvotes: 1

Related Questions