Reputation: 501
I am working on some data received from google big query which contains some special emoji in the data. I have a code that removes the emoji but it is not working for below specific emoji.
sample code that removes all emoji but not for the below case.
Using version
Python 3.9
from re import UNICODE, compile
emoji_pattern = compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U0001F1F2-\U0001F1F4" # Macau flag
u"\U0001F1E6-\U0001F1FF" # flags
u"\U0001F600-\U0001F64F"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U0001F1F2"
u"\U0001F1F4"
u"\U0001F620"
u"\u200d"
u"\u2640-\u2642"
"]+", flags=UNICODE)
# Works for this one
data = 'support.google.co.uk/s/.💻'
result = emoji_pattern.subn(r'', data)
# result --> ('support.google.co.uk/s/.', 1)
# Doesn't work in this case
data = 'www.google.co.uk/?🤣'
result = emoji_pattern.subn(r'', data)
# result --> ('www.google.co.uk/?🤣', 0)
Can someone help me with this case. Also it would be much helpful if someone can help me how to check the Unicode representation for 🤣 (any special character or emoji) in python 3.9 so that I can update such unicode in the emoji pattern.
Upvotes: 4
Views: 1767
Reputation: 501
Modified emoji pattern list just for the reference.
emoji_pattern = compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U0001F1F2"
u"\U0001F1F4"
u"\U0001F620"
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
u"\U00002500-\U00002BEF" # Chinese char
u"\U00010000-\U0010ffff"
"]+", flags=UNICODE)
Thank you
Upvotes: 2
Reputation: 212
check out this answer, the emoji
python package seems like the best way to solve this problem.
to convert any emoji/character into UTF-8 do this:
import emoji
s = '🤣'
print(s.encode('unicode-escape').decode('ASCII'))
it'd print \U0001f600
Upvotes: 5