Reputation: 33
I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.
index| messages
----------------
1 |Hello! 👋
2 |Good Morning 😃
3 |How are you ?
4 | Good 👍
5 | Ländern
Now I want to remove all these emojis from the DataFrame so it looks like this
index| messages
----------------
1 |Hello!
2 |Good Morning
3 |How are you ?
4 | Good
5 |Ländern
I tried the solution here but unfortunately it also removes all non-English letters like "ä" How can I remove emojis from a dataframe?
Upvotes: 1
Views: 1906
Reputation: 142038
You can use emoji package:
import emoji
df = ...
df['messages'] = df['messages'].apply(lambda s: emoji.replace_emoji(s, ''))
Upvotes: 1
Reputation: 15309
This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024
:
df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
Result:
messages
0 Länder
1 Hello!
Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.
Upvotes: 3
Reputation: 4921
I think the following is answering your question. I added some other characters for verification.
import pandas as pd
df = pd.DataFrame({'messages':['Hello! 👋', 'Good-Morning 😃', 'How are you ?', ' Goodé 👍', 'Ländern' ]})
df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))
Upvotes: 2