sheldonzy
sheldonzy

Reputation: 5931

Counter for words and emoji

I have a dataframe with a column "clear_message", and I created a column that counts all the words in each row.

history['word_count'] = history.clear_message.apply(lambda x: Counter(x.split(' ')))

For example, if the rows message is: Hello my name is Hello Then the counter in his row, will be Counter({'Hello': 2, 'is': 1, 'my': 1, 'name': 1})

The problem

I have emoji in my text, and I want also a counter for the emoji.

For example:

test = '๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น๐Ÿ‘นhere sasdsa'
test_counter = Counter(test.split(' '))

The output is:

Counter({'sasdsa': 1, '๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น๐Ÿ‘นhere': 1})

But I want:

Counter({'sasdsa': 1, '๐Ÿ‘น': 5, 'here':1})

Clearly the problem is that I'm using split(' ').

What I thought about:

Adding a space before and after the emoji. like:

test = '๐Ÿ‘น ๐Ÿ‘น ๐Ÿ‘น ๐Ÿ‘น ๐Ÿ‘น here sasdsa'

And then use the split, which will work.

  1. Not sure this approach is the best.
  2. Not sure how to do it. (I do know that if i is an emoji, then if i in emoji.UNICODE_EMOJI will return true (the emoji package)).

Upvotes: 2

Views: 671

Answers (3)

Andj
Andj

Reputation: 1364

I thought I'd revisit this question, using a string with more complex emoji graphemes.

If we take a string ' a๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น ads๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง bs๐Ÿ‡ฆ๐Ÿ‡บ' the two older answers will return: ['a', '๐Ÿ‘น', '๐Ÿ‘น', '๐Ÿ‘น', 'ads', '๐Ÿ‘จ', '\u200d', '๐Ÿ‘จ', '\u200d', '๐Ÿ‘ง', 'bs๐Ÿ‡ฆ๐Ÿ‡บ'] and ['a', '๐Ÿ‘น', '๐Ÿ‘น', '๐Ÿ‘น', 'ads๐Ÿ‘จ', '\u200d๐Ÿ‘จ', '\u200d๐Ÿ‘ง', 'bs๐Ÿ‡ฆ๐Ÿ‡บ'].

Both solutions handle simple emoji, but fail with more complex emoji graphemes.

Below I will use a number of Unicode regular expressions using the regex module, rather than the re module.

I've split the processing into three discrete functions, so it is easier to follow.

split_emoji() will split emoji substrings into individual graphemes. So ๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น is split into three, ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง remains as one item, as does ๐Ÿ‡ฆ๐Ÿ‡บ.

prep_string() will add extra space around sequences of letters, then will split emoji graphemes, returning a string.

count_elements() will call prep_string, split the resultant string on one or more whitespace characters, then count the elements.

import regex
from collections import Counter

def split_emoji(e):
    if regex.match(r'\p{Emoji}+', e):
        return ' '.join(regex.findall(r'\X', e))
    return e

def prep_string(text):
    text = regex.sub(r'(\p{L}+)', r" \1 ", text)
    items = [split_emoji(substr) for substr in text.split()]
    return " ".join(items)

def count_elements(text):
    return Counter(regex.split(r'\s+', prep_string(text)))

s = ' a๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น ads๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘ง bs๐Ÿ‡ฆ๐Ÿ‡บ'
count_elements(s)
# Counter({'๐Ÿ‘น': 3, 'a': 1, 'ads': 1, '๐Ÿ‘จ\u200d๐Ÿ‘จ\u200d๐Ÿ‘ง': 1, 'bs': 1, '๐Ÿ‡ฆ๐Ÿ‡บ': 1})

There are further refinements and edge cases that could be incorporated, like injecting spaces between punctuation and emoji, or stripping punctuation.

Upvotes: 1

ConorSheehan1
ConorSheehan1

Reputation: 1725

I think your idea of adding a space after each emoji is a good approach. You'll also need to strip white space in case there already was a space between an emoji and the next character, but that's simple enough. Something like:

def emoji_splitter(text):
    new_string = ""
    for char in text:
        if char in emoji.UNICODE_EMOJI:
            new_string += " {} ".format(char)
        else:
            new_string += char
    return [v for v in map(lambda x: x.strip(), new_string.split(" ")) if v != ""]

Maybe you could improve this by using a sliding window to check for spaces after emojis and only add spaces where necessary, but that would assume there will only ever be one space, where as this solution should account for 0 to n spaces between emojis.

Upvotes: 2

sheldonzy
sheldonzy

Reputation: 5931

there was some problems with @con-- answer, so I fixed it.

def emoji_splitter(text):
    new_string = ""
    text = text.lstrip()
    if text:
        new_string += text[0] + " "
    for char in ' '.join(text[1:].split()):
        new_string += char
        if char in emoji.UNICODE_EMOJI:
            new_string = new_string + " " 
    return list(map(lambda x: x.strip(), new_string.split()))

example:

emoji_splitter(' a๐Ÿ‘น๐Ÿ‘น๐Ÿ‘น ads')
Out[7]: ['a', '๐Ÿ‘น', '๐Ÿ‘น', '๐Ÿ‘น', 'ads']

Upvotes: 1

Related Questions