Escaped Unicode to Emoji in Python

Question

I am trying to convert Escaped Unicode into Emojis.

Example:

>>> emoji = "😀"
>>> emoji_text = "\ud83d\ude00"
>>> print(emoji)
😀
>>> print(emoji_text)
\ud83d\ude00

instead of "\ud83d\ude00" I would like to print 😀

I found a simple trick that works but is not practical:

>>> import json
>>> json.loads('"\ud83d\ude00"')
'😀'

Mark Tolonen · Accepted Answer

Your example is close to JSON's ensure_ascii=True string output except for needing double quotes in the string. It contains Unicode escaped high/low surrogates for a Unicode character above U+FFFF.

Note the unicode-escape codec can't be used alone for conversion. It will create a Unicode string with surrogates, which is illegal. You won't be able to print or encode the string for serialization.

>>> s = "\ud83d\ude00"
>>> s = s.encode('ascii').decode('unicode-escape')
>>> s
'\ud83d\ude00'
>>> print(s)  # UnicodeEncodeError: surrogates not allowed

Using the surrogatepass error handler with the utf-16 codec, you can undo the surrogates and decode the string properly. Note that this will decode non-surrogate escape codes as well:

>>> s = "Hello\u9a6c\u514b\ud83d\ude00"
>>> s.encode('ascii').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')
'Hello马克😀'

Older solution:

The following code will replace Unicode surrogates with their Unicode code point. If you have other non-surrogate Unicode escapes, it will replace them with their code points as well.

import re

def process(m):
    '''process(m) -> Unicode code point

    m is a regular expression match object that has groups below:
     1: high Unicode surrogate 4-digit hex code d800-dbff
     2: low  Unicode surrogate 4-digit hex code dc00-dfff
     3: None
    OR
     1: None
     2: None
     3: Unicode 4-digit hex code 0000-d700,e000-ffff
    '''
    if m.group(3) is None:
        # Construct code point from UTF-16 surrogates
        hi = int(m.group(1),16) & 0x3FF
        lo = int(m.group(2),16) & 0x3FF
        cp = 0x10000 | hi << 10 | lo
    else:
        cp = int(m.group(3),16)
    return chr(cp)

s = "Hello\u9a6c\u514b\ud83d\ude00"
s = re.sub(r'\u(d[89ab][0-9a-f]{2})\u(d[cdef][0-9a-f]{2})|\u([0-9a-f]{4})',process,s)
print(s)

Output:

Hello马克😀

Escaped Unicode to Emoji in Python

Answers (1)

Older solution:

Related Questions