Bolster
Bolster

Reputation: 7916

Converting From U+ unicode string definition to true unicode character

I have a long list of unicode definitions and description mappings that use the 'U+1F49A' coding convention.

In python (3), how can I read these in as true unicode characters? (i.e. '\u00001F49A' or 'πŸ’š'

I've tried array slicing and composition eg '\U000{}'.format('1F49A') but end up with SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \UXXXXXXXX escape as the initial string instantiation craps out on a partial unicode declaration.

Upvotes: 6

Views: 2506

Answers (1)

Robα΅©
Robα΅©

Reputation: 168716

You can also use int() to parse the number, and chr() to convert the number to a single-character string.

For example:

In [8]: chr(0x1f49a)
Out[8]: 'πŸ’š'

In [9]: s='U+1F49A'

In [10]: chr(int(s[2:], 16))
Out[10]: 'πŸ’š'

If you want to convert all of the U+xxxx instances in a larger string, you can use the same chr()/int() pattern in the 2nd arg of re.sub():

In [14]: s = 'U+1F49A -vs- U+2764'

In [15]: re.sub(r'U\+([0-9a-fA-F]+)', lambda m: chr(int(m.group(1),16)), s)
Out[15]: 'πŸ’š -vs- ❀'

Upvotes: 13

Related Questions