Binit Amin
Binit Amin

Reputation: 501

Remove emoji from string doesn't works for some cases

I am working on some data received from google big query which contains some special emoji in the data. I have a code that removes the emoji but it is not working for below specific emoji.

sample code that removes all emoji but not for the below case.

Using version Python 3.9

from re import UNICODE, compile
emoji_pattern = compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        u"\U0001F1F2-\U0001F1F4"  # Macau flag
                        u"\U0001F1E6-\U0001F1FF"  # flags
                        u"\U0001F600-\U0001F64F"
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                        u"\U0001f926-\U0001f937"
                        u"\U0001F1F2"
                        u"\U0001F1F4"
                        u"\U0001F620"
                        u"\u200d"
                        u"\u2640-\u2642"
                        "]+", flags=UNICODE)

# Works for this one 
data = 'support.google.co.uk/s/.💻'
result = emoji_pattern.subn(r'', data)
# result --> ('support.google.co.uk/s/.', 1)

# Doesn't work in this case
data = 'www.google.co.uk/?🤣'
result = emoji_pattern.subn(r'', data)
# result --> ('www.google.co.uk/?🤣', 0)

Can someone help me with this case. Also it would be much helpful if someone can help me how to check the Unicode representation for 🤣 (any special character or emoji) in python 3.9 so that I can update such unicode in the emoji pattern.

Upvotes: 4

Views: 1767

Answers (2)

Binit Amin
Binit Amin

Reputation: 501

Modified emoji pattern list just for the reference.

emoji_pattern = compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                        u"\U0001f926-\U0001f937"
                        u"\U0001F1F2"
                        u"\U0001F1F4"
                        u"\U0001F620"
                        u"\u200d"
                        u"\u2640-\u2642"
                        u"\u2600-\u2B55"
                        u"\u23cf"
                        u"\u23e9"
                        u"\u231a"
                        u"\ufe0f"  # dingbats
                        u"\u3030"
                        u"\U00002500-\U00002BEF"  # Chinese char
                        u"\U00010000-\U0010ffff"
                        "]+", flags=UNICODE)

Thank you

Upvotes: 2

Rama Salahat
Rama Salahat

Reputation: 212

check out this answer, the emoji python package seems like the best way to solve this problem.

to convert any emoji/character into UTF-8 do this:

import emoji
s = '🤣'
print(s.encode('unicode-escape').decode('ASCII'))

it'd print \U0001f600

Upvotes: 5

Related Questions