sheldonzy
sheldonzy

Reputation: 5931

Split and count emojis and words in a given string in Python

For a given string, I'm trying to count the number of appearances of each word and emoji. I did it already here for emojis that consists only from 1 emoji. The problem is that a lot of the current emojis are composed from a few emojis.

Like the emoji πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ consists of four emojis - πŸ‘¨β€ πŸ‘©β€ πŸ‘¦β€ πŸ‘¦, and emojis with human skin color, for example πŸ™…πŸ½ is πŸ™… 🏽 etc.

The problem boils down to how to split the string in the right order, and then counting them is easy.

There are some good questions that addressed the same thing, like link1 and link2 , but none of them applies to the general solution (or the solution is outdated or I just can't figure it out).

For example, if the string would be hello πŸ‘©πŸΎβ€πŸŽ“ emoji hello πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦, then I'll have {'hello':2, 'emoji':1, 'πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦':1, 'πŸ‘©πŸΎβ€πŸŽ“':1} My strings are from Whatsapp, and all were encoded in utf8.

I had many bad attempts. Help would be appreciated.

Upvotes: 3

Views: 2558

Answers (3)

William Egesdal
William Egesdal

Reputation: 1

emoji.UNICODE_EMOJI is a dictionary with structure

{'en': 
    {'πŸ₯‡': ':1st_place_medal:',
     'πŸ₯ˆ': ':2nd_place_medal:',
     'πŸ₯‰': ':3rd_place_medal:' 
... }
}

so you will need to use emoji.UNICODE_EMOJI['en'] for the above code to work.

Upvotes: 0

sheldonzy
sheldonzy

Reputation: 5931

Huge thanks Mark Tolonen. Now in order to count words and emojis and words in a given string, I'll use emoji.UNICOME_EMOJI in order to decide what is an emoji and what is not (from the emoji package), and then remove from the string the emojis.

Currently not an ideal answer, but it works and I'll edit if it will be changed.

import emoji
import regex
def split_count(text):
    total_emoji = []
    data = regex.findall(r'\X',text)
    flag = False
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):  
            total_emoji += [word] # total_emoji is a list of all emojis

    # Remove from the given text the emojis
    for current in total_emoji:
        text = text.replace(current, '') 

    return Counter(text.split() + total_emoji)


text_string = "πŸ‘ΉπŸ‘ΉπŸ‘ΉπŸ‘ΉπŸ‘Ήhere hello world helloπŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦πŸ™…πŸ½"    
final_counter = split_count(text_string)

Output:

final_counter
Counter({'hello': 2,
         'here': 1,
         'world': 1,
         'πŸ‘¨\u200dπŸ‘©\u200dπŸ‘¦\u200dπŸ‘¦': 1,
         'πŸ‘Ή': 5,
         'πŸ™…πŸ½': 1})

Upvotes: 3

Mark Tolonen
Mark Tolonen

Reputation: 177406

Use the 3rd party regex module, which supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character):

>>> import regex
>>> s='πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦πŸ™…πŸ½'
>>> regex.findall(r'\X',s)
['πŸ‘¨\u200dπŸ‘©\u200dπŸ‘¦\u200dπŸ‘¦', 'πŸ™…πŸ½']
>>> for c in regex.findall('\X',s):
...     print(c)
... 
πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦
πŸ™…πŸ½

To count them:

>>> data = regex.findall(r'\X',s)
>>> from collections import Counter
>>> Counter(data)
Counter({'πŸ‘¨\u200dπŸ‘©\u200dπŸ‘¦\u200dπŸ‘¦': 1, 'πŸ™…πŸ½': 1})

Upvotes: 2

Related Questions