Reputation: 3327
I am working in Python with pandas and I have a data frame in which one of its columns contain phrases that include emojis, such as "when life gives you 🍋s, make lemonade" or "Catch a falling ⭐️ and put it in your pocket". Not all the phrases have emojis and if they do, it could be anywhere in the phrase (not just the beginning or end). I want to go through each text, and essentially count the frequencies for each of the emojis that appear, the emojis that appear the most, etc. I am not sure how to actually process/recognize the emojis. If I go through each of the texts in the column, how would I go about identifying the emoji so I can gather the desire information such as counts, max, etc.
Upvotes: 3
Views: 5223
Reputation: 5860
Suppose you have a dataframe like this
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'phrases' : ["Smiley emoticon rocks!🍋 I like you.\U0001f601",
"Catch a falling ⭐️ and put it in your pocket"]})
which yields
phrases
0 Smiley emoticon rocks!🍋 I like you.😁
1 Catch a falling ⭐️ and put it in your pocket
You can do something like :
# Dictionary storing emoji counts
emoji_count = defaultdict(int)
for i in df['phrases']:
for emoji in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000]', i):
emoji_count[emoji] += 1
print (emoji_count)
Note that I have changed the range in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000', i)
.
The alternate part is to handle different unicode group, but you should get the idea.
In Python 2.x you can convert the emoji to unicode using
unicode('⭐️ ', 'utf-8') # u'\u2b50\ufe0f' - output
Output :
defaultdict(int, {'⭐': 1, '🍋': 1, '😁': 1})
That regex is shamelessly stolen from this link.
Upvotes: 3