Sam
Sam

Reputation: 33

Python pandas: Remove emojis from DataFrame

I have a dataframe which contains a lot of different emojis and I want to remove them. I looked at answers to similar questions but they didn't work for me.

index| messages
----------------
1    |Hello! 👋 
2    |Good Morning 😃  
3    |How are you ?
4    | Good 👍
5    | Ländern

Now I want to remove all these emojis from the DataFrame so it looks like this

    index| messages
    ----------------
    1    |Hello!
    2    |Good Morning   
    3    |How are you ?
    4    | Good 
    5    |Ländern

I tried the solution here but unfortunately it also removes all non-English letters like "ä" How can I remove emojis from a dataframe?

Upvotes: 1

Views: 1906

Answers (3)

Guru Stron
Guru Stron

Reputation: 142038

You can use emoji package:

import emoji

df = ...
df['messages'] = df['messages'].apply(lambda s: emoji.replace_emoji(s, ''))

Upvotes: 1

xjcl
xjcl

Reputation: 15309

This solution that will keep all ASCII and latin-1 characters, i.e. characters between U+0000 and U+00FF in this list. For extended Latin plus Greek, use < 1024:

df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})

filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))

Result:

  messages
0  Länder 
1  Hello!

Note this does not work for Japanese text for example. Another problem is that the heart "emoji" is actually a Dingbat so I can't simply filter for the Basic Multilingual Plane of Unicode, oh well.

Upvotes: 3

Ruthger Righart
Ruthger Righart

Reputation: 4921

I think the following is answering your question. I added some other characters for verification.

import pandas as pd
df = pd.DataFrame({'messages':['Hello! 👋', 'Good-Morning 😃', 'How are you ?', ' Goodé 👍', 'Ländern' ]})

df['messages'].astype(str).apply(lambda x: x.encode('latin-1', 'ignore').decode('latin-1'))

Upvotes: 2

Related Questions