How to encode emojis that are in text with Python/pandas (for counting them/finding most frequently occurring, etc)?

Question

I am working in Python with pandas and I have a data frame in which one of its columns contain phrases that include emojis, such as "when life gives you 🍋s, make lemonade" or "Catch a falling ⭐️ and put it in your pocket". Not all the phrases have emojis and if they do, it could be anywhere in the phrase (not just the beginning or end). I want to go through each text, and essentially count the frequencies for each of the emojis that appear, the emojis that appear the most, etc. I am not sure how to actually process/recognize the emojis. If I go through each of the texts in the column, how would I go about identifying the emoji so I can gather the desire information such as counts, max, etc.

hashcode55 · Accepted Answer

Suppose you have a dataframe like this

import pandas as pd
from collections import defaultdict

df = pd.DataFrame({'phrases' : ["Smiley emoticon rocks!🍋 I like you.\U0001f601", 
                                "Catch a falling ⭐️ and put it in your pocket"]})

which yields

                 phrases
0   Smiley emoticon rocks!🍋 I like you.😁
1   Catch a falling ⭐️ and put it in your pocket

You can do something like :

# Dictionary storing emoji counts 
emoji_count = defaultdict(int)
for i in df['phrases']:
    for emoji in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000]', i):
        emoji_count[emoji] += 1

print (emoji_count)

Note that I have changed the range in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000', i).

The alternate part is to handle different unicode group, but you should get the idea.

In Python 2.x you can convert the emoji to unicode using

unicode('⭐️ ', 'utf-8') # u'\u2b50\ufe0f' - output

Output :

defaultdict(int, {'⭐': 1, '🍋': 1, '😁': 1})

That regex is shamelessly stolen from this link.

How to encode emojis that are in text with Python/pandas (for counting them/finding most frequently occurring, etc)?

Answers (1)

Related Questions