Reputation: 101
I have a collection of twits and i want to check emojis in them, but it looks like the writing procedure for the collection converted all emojis in string for example 'š' is ':-)' in text and 'š' is ':D' and so on with all emojis. If we try to check unicode codepoints for them we get ':-)'.encode('utf-8')
equals to b':-)'
in the same time 'š'.encode('utf-8')
equals to 'b'\xf0\x9f\x98\x8a
and equality check fails. Using utf-16
: ':-)'.encode('utf-16')
equals to b'\xff\xfe:\x00-\x00)\x00'
and 'š'.encode('utf-16')
is b'\xff\xfe=\xd8\n\xde'
. So is there any way to convert text representations such as ':-)' back to emoji 'š'.
Upvotes: 3
Views: 4378
Reputation: 30103
Use a dictionary to convert any text emoticon back to emoji e.g. as follows:
>>> dict_emo = { ':-)' : b'\xf0\x9f\x98\x8a',
... ':)' : b'\xf0\x9f\x98\x8a',
... '=)' : b'\xf0\x9f\x98\x8a', # Smile or happy
... ':-D' : b'\xf0\x9f\x98\x83',
... ':D' : b'\xf0\x9f\x98\x83',
... '=D' : b'\xf0\x9f\x98\x83', # Big smile
... '>:-(' : b'\xF0\x9F\x98\xA0',
... '>:-o' : b'\xF0\x9F\x98\xA0' # Angry face
... }
>>> print( dict_emo[':)'].decode('utf-8'))
š
>>> print( dict_emo['>:-('].decode('utf-8'))
š
>>> print( dict_emo[':-D'].decode('utf-8'))
š
>>>
>>>
>>> dict_emot= { ':-)' : b'\xf0\x9f\x98\x8a'.decode('utf-8'),
... ':)' : b'\xf0\x9f\x98\x8a'.decode('utf-8'),
... '=)' : b'\xf0\x9f\x98\x8a'.decode('utf-8'), # Smile or happy
... ':-D' : b'\xf0\x9f\x98\x83'.decode('utf-8'),
... ':D' : b'\xf0\x9f\x98\x83'.decode('utf-8'),
... '=D' : b'\xf0\x9f\x98\x83'.decode('utf-8'), # Big smile
... '>:-(' : b'\xF0\x9F\x98\xA0'.decode('utf-8'),
... '>:-o' : b'\xF0\x9F\x98\xA0'.decode('utf-8') # Angry face
... }
>>> print( dict_emot[':)'] )
š
>>> print( dict_emot['>:-o'] )
š
>>> print( dict_emot['=D'] )
š
>>>
Unfortunately, there are at least two tasks remaining:
:-)
smile in :-))
double chin.Upvotes: 5