Reputation: 1201
What are the names for these different kinds of ascii representations of unicode?
\xF0\x9F\x98\xA2
\U0001f622
And is there a term for the set that they belong to that's more specific than "representation"? And in the context of these, how would I describe the non-ascii representation (😢)?
Since I don't know what to call them it is very hard to search for how to work with them.
Thanks!
Upvotes: 1
Views: 205
Reputation: 20802
For Python 3
First there seems to be a misunderstanding about the hex escapes:
print("\xF0\x9F\x98\xA2" == "\u00F0\u009F\u0098\u00A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\U000000F0\U0000009F\U00000098\U000000A2")
print("\xF0\x9F\x98\xA2" == "\N{LATIN SMALL LETTER ETH}\N{APPLICATION PROGRAM COMMAND}\N{START OF STRING}\N{CENT SIGN}")
and for completeness (I recall using octal effectively in machine code where some instructions had 3-bit, aligned arguments but I don't see the point in real programming):
print("\xF0\x9F\x98\xA2" == "\360\237\230\242")
It appears they are all Unicode codepoint escapes in 2-digit hexadecimal, 4-digit hexadecimal, and 8-digit hexadecimal, with ranges from U+0000 to U+00FF, U+FFFF, and U+10FFFF, respectively.
We can confirm that, unlike other languages where the \u for is for a UTF-16 code unit, in Python 3, it is really a codepoint.
print("\ud83d\ude22" == "\U0000d83d\U0000de22")
and for completeness:
print("\U0001f622" == "😢")
print("\N{CRYING FACE}" == "😢")
In other languages (where they would be two UTF-16 code units), "\ud83d\ude22"
would equal "😢"
.
Now, U+D8ED and U+DE22 are Unicode codepoints designated as surrogates. In other words, not characters. They reserve the codepoint codespace for the UTF-16 code units with corresponding values. This is the way the USC-2 encoding of Unicode was transparently extended to UTF-16 when Unicode was expanded from 2^16 codepoints to 2^21 codepoints. For more information see the Unicode FAQ.
As @Robᵩ points out, you can have a bytestring literal, too:
print("\U0001f622".encode("utf-8") == b"\xF0\x9F\x98\xA2")
Upvotes: 1
Reputation: 1508
As Tom Blodget already warned you, this is a somewhat python specific answer.
The leading \
shows that it's an escape sequence.
\x
means that the next two characters will be interpreted as a hex digit.
\U
means that the next eight characters will be interpreted as a 32-bit hex value.
You can read more about that here:
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
To fully answer your question:
\xF0\x9F\x98\xA2
are simply four ASCII characters and you have their hex values\U0001f622
is a UNICODE codepoint encoded with a 32-bit hex value😢
is a glyph or simply a special character.Upvotes: 1