Reputation: 169
I am working on emoji detection module. For some emojis I am observing weird behavior that is after converting them to utf-8 encoding they are not converted back to their original representation form. I need their exact colored representation to be send as API response instead of sending unicode escaped string. Any leads?
In [1]: x = "example1: š¤ and example2: š and example3: š„ŗ"
In [2]: x.encode('utf8')
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'
In [3]: x.encode('utf8').decode('utf8')
Out[3]: 'example1: \U0001f92d and example2: š and example3: \U0001f97a'
In [4]: print( x.encode('utf8').decode('utf8') )
*example1: š¤ and example2: š and example3: š„ŗ*
Update 1: By this example it must be much clearer to explain. Here, two emojis are rendered when I have send unicode escape string, but 3rd exampled failed to convert exact emoji, what to do in such case?
Upvotes: 0
Views: 1010
Reputation: 177406
'\U0001f92d' == 'š¤'
is True
. It is an escape code but is still the same character...Two ways of display/entry. The former is the repr()
of the string, printing calls str()
. Example:
>>> s = 'š¤'
>>> print(repr(s))
'\U0001f92d'
>>> print(str())
š¤
>>> s
'\U0001f92d'
>>> print(s)
š¤
When Python generates the repr() it uses an escape code representation if it thinks the display can't handle the character. The content of the string is still the same...the Unicode code point.
It's a debug feature. For example, is the white space spaces or tabs? The repr()
of the string makes it clear by using \t
as an escape code.
>>> s = 'a\tb'
>>> print(s)
a b
>>> s
'a\tb'
As to why an escape code is used for one emoji and not another, it depends on the version of Unicode supported by the version of Python used.
Pyton 3.8 uses Unicode 9.0, and one of your emoji isn't defined at that version level:
>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('š')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('š¤')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
Upvotes: 3