jms
jms

Reputation: 225

Extract and count emoji in python pandas

I have a csv file (also exists as json if thats easier) with this structure: e.g.

id,created_at,text,author_id
1432691762007400458,2021-08-31T13:08:28.000Z,"9月1日㈬は…🎺🎺🎶\n\n♦️18:50〜 山岸一生\n『練馬から変える!国会を創る!キックオフ集会』\nhttpstest\n\n♦️20:30~ 辻元清美\n#りっけんチャンネル\n「コロナ禍・五輪から見えた""おっさん政治""の実態」について\nhttpsxzt\n\n【テレビ】\n♦️19:30~ 玄葉光一郎\nBS-TBS「報道1930」",951781409470889984
1432687902148816898,2021-08-31T12:53:08.000Z,やはり別の地平であったか...\n\nコロナ禍五輪\n貴重な物資をなぜ捨てる?,1227501971742937088

I found this:

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

however, I would first need to import the csv and limit the query to column "text" right? So I was wondering if I could somewhat set it up like this:

import csv
import re
import pandas as pd
import emoji
import regex

df = pd.read_csv('/Users/hidden/testfile.csv')
df = df[['id','created_at','text','author_id']]

def split_count(text):
    emoji_list = []
    data = regex.findall(r'\X', 'text')
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

However, this way it wouldn't even start the module 'emoji'. Do I need to download an extra package (I already installed emoji package via pip). Also is there a way to get a new list of emoji + emoji count? E.g. like this:

🎺 2 🎶 1 I am sorry for the very insufficient MWE, please bare with me .

Upvotes: 1

Views: 1235

Answers (1)

zana saedpanah
zana saedpanah

Reputation: 334

First, you must use text columns only. then in your code, there was a mistake in

data = regex.findall(r'\X', 'text') you use quotation for text variable which is wrong. and if you want count the number of emojis use Counter library like this:

def split_count(text):
    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

df =pd.DataFrame([{'id':1432691762007400458,'created_at':"2021-08-31T13:08:28.000Z",'text':"9月1日㈬は…🎺🎺🎶\n\n♦️18:50〜 山岸一生\n『練馬から変える!国会を創る!キックオフ集会』\nhttpstest\n\n♦️20:30~ 辻元清美\n#りっけんチャンネル\n「コロナ禍・五輪から見えた""おっさん政治""の実態」について\nhttpsxzt\n\n【テレビ】\n♦️19:30~ 玄葉光一郎\nBS-TBS「報道1930」",'author_id':951781409470889984},
{'id':1432687902148816898,'created_at':"2021-08-31T12:53:08.000Z",'text':"やはり別の地平であった🎶か...\n\nコロナ禍五輪\n貴重な物資をなぜ捨てる?",'author_id':1227501971742937088}])

text = df['text']
emoji_list= [] 
for t in text:
  emoji_list=emoji_list+split_count(t)


from collections import Counter

print(Counter(emoji_list))


------------------
output:
Counter({'♦️': 3, '🎶': 2, '🎺': 2})

Note that some of the text may have emoji in HTML encode so those emoji can't be detected I recommend you to use `text = html.unescape(text)' at the beginning of the split_count function to convert this emoji to Unicode format. you may need to install HTMLParser for this part.

there are other libraries two preprocess text and use them for your purpose. like ekphrasis library. which can add a custom emoji dictionary.

also, you can write it from scratch like this code:

emoticons = [':-)', ':)', '(:', '(-:', ':))', '((:', ':-D', ':D', 'X-D', 'XD', 'xD', 'xD', '<3', '</3', ':\*',
                 ';-)',
                 ';)', ';-D', ';D', '(;', '(-;', ':-(', ':(', '(:', '(-:', ':,(', ':\'(', ':"(', ':((', ':D', '=D',
                 '=)',
                 '(=', '=(', ')=', '=-O', 'O-=', ':o', 'o:', 'O:', 'O:', ':-o', 'o-:', ':P', ':p', ':S', ':s', ':@',
                 ':>',
                 ':<', '^_^', '^.^', '>.>', 'T_T', 'T-T', '-.-', '*.*', '~.~', ':*', ':-*', 'xP', 'XP', 'XP', 'Xp',
                 ':-|',
                 ':->', ':-<', '$_$', '8-)', ':-P', ':-p', '=P', '=p', ':*)', '*-*', 'B-)', 'O.o', 'X-(', ')-X']

def split_count(text):
    text = html.unescape(text)
    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    for word in text.split(' '):
        if word in emoticons :
          emoji_list.append(word)

    return emoji_list

Upvotes: 1

Related Questions