Reputation: 772
I am trying to convert Escaped Unicode into Emojis.
Example:
>>> emoji = "π"
>>> emoji_text = "\\ud83d\\ude00"
>>> print(emoji)
π
>>> print(emoji_text)
\ud83d\ude00
instead of "\ud83d\ude00" I would like to print π
I found a simple trick that works but is not practical:
>>> import json
>>> json.loads('"\\ud83d\\ude00"')
'π'
Upvotes: 3
Views: 5024
Reputation: 177406
Your example is close to JSON's ensure_ascii=True
string output except for needing double quotes in the string. It contains Unicode escaped high/low surrogates for a Unicode character above U+FFFF.
Note the unicode-escape
codec can't be used alone for conversion. It will create a Unicode string with surrogates, which is illegal. You won't be able to print or encode the string for serialization.
>>> s = "\\ud83d\\ude00"
>>> s = s.encode('ascii').decode('unicode-escape')
>>> s
'\ud83d\ude00'
>>> print(s) # UnicodeEncodeError: surrogates not allowed
Using the surrogatepass
error handler with the utf-16
codec, you can undo the surrogates and decode the string properly. Note that this will decode non-surrogate escape codes as well:
>>> s = "Hello\\u9a6c\\u514b\\ud83d\\ude00"
>>> s.encode('ascii').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')
'Hello马ε
π'
The following code will replace Unicode surrogates with their Unicode code point. If you have other non-surrogate Unicode escapes, it will replace them with their code points as well.
import re
def process(m):
'''process(m) -> Unicode code point
m is a regular expression match object that has groups below:
1: high Unicode surrogate 4-digit hex code d800-dbff
2: low Unicode surrogate 4-digit hex code dc00-dfff
3: None
OR
1: None
2: None
3: Unicode 4-digit hex code 0000-d700,e000-ffff
'''
if m.group(3) is None:
# Construct code point from UTF-16 surrogates
hi = int(m.group(1),16) & 0x3FF
lo = int(m.group(2),16) & 0x3FF
cp = 0x10000 | hi << 10 | lo
else:
cp = int(m.group(3),16)
return chr(cp)
s = "Hello\\u9a6c\\u514b\\ud83d\\ude00"
s = re.sub(r'\\u(d[89ab][0-9a-f]{2})\\u(d[cdef][0-9a-f]{2})|\\u([0-9a-f]{4})',process,s)
print(s)
Output:
Hello马ε
π
Upvotes: 5