Reputation: 5931
I have a dataframe with a column "clear_message", and I created a column that counts all the words in each row.
history['word_count'] = history.clear_message.apply(lambda x: Counter(x.split(' ')))
For example, if the rows message is: Hello my name is Hello
Then the counter in his row, will be Counter({'Hello': 2, 'is': 1, 'my': 1, 'name': 1})
The problem
I have emoji in my text, and I want also a counter for the emoji.
For example:
test = '๐น๐น๐น๐น๐นhere sasdsa'
test_counter = Counter(test.split(' '))
The output is:
Counter({'sasdsa': 1, '๐น๐น๐น๐น๐นhere': 1})
But I want:
Counter({'sasdsa': 1, '๐น': 5, 'here':1})
Clearly the problem is that I'm using split(' ')
.
What I thought about:
Adding a space before and after the emoji. like:
test = '๐น ๐น ๐น ๐น ๐น here sasdsa'
And then use the split, which will work.
i
is an emoji, then if i in emoji.UNICODE_EMOJI
will return true (the emoji
package)).Upvotes: 2
Views: 671
Reputation: 1364
I thought I'd revisit this question, using a string with more complex emoji graphemes.
If we take a string ' a๐น๐น๐น ads๐จโ๐จโ๐ง bs๐ฆ๐บ'
the two older answers will return: ['a', '๐น', '๐น', '๐น', 'ads', '๐จ', '\u200d', '๐จ', '\u200d', '๐ง', 'bs๐ฆ๐บ']
and ['a', '๐น', '๐น', '๐น', 'ads๐จ', '\u200d๐จ', '\u200d๐ง', 'bs๐ฆ๐บ']
.
Both solutions handle simple emoji, but fail with more complex emoji graphemes.
Below I will use a number of Unicode regular expressions using the regex module, rather than the re module.
I've split the processing into three discrete functions, so it is easier to follow.
split_emoji()
will split emoji substrings into individual graphemes. So ๐น๐น๐น is split into three, ๐จโ๐จโ๐ง remains as one item, as does ๐ฆ๐บ.
prep_string()
will add extra space around sequences of letters, then will split emoji graphemes, returning a string.
count_elements()
will call prep_string, split the resultant string on one or more whitespace characters, then count the elements.
import regex
from collections import Counter
def split_emoji(e):
if regex.match(r'\p{Emoji}+', e):
return ' '.join(regex.findall(r'\X', e))
return e
def prep_string(text):
text = regex.sub(r'(\p{L}+)', r" \1 ", text)
items = [split_emoji(substr) for substr in text.split()]
return " ".join(items)
def count_elements(text):
return Counter(regex.split(r'\s+', prep_string(text)))
s = ' a๐น๐น๐น ads๐จโ๐จโ๐ง bs๐ฆ๐บ'
count_elements(s)
# Counter({'๐น': 3, 'a': 1, 'ads': 1, '๐จ\u200d๐จ\u200d๐ง': 1, 'bs': 1, '๐ฆ๐บ': 1})
There are further refinements and edge cases that could be incorporated, like injecting spaces between punctuation and emoji, or stripping punctuation.
Upvotes: 1
Reputation: 1725
I think your idea of adding a space after each emoji is a good approach. You'll also need to strip white space in case there already was a space between an emoji and the next character, but that's simple enough. Something like:
def emoji_splitter(text):
new_string = ""
for char in text:
if char in emoji.UNICODE_EMOJI:
new_string += " {} ".format(char)
else:
new_string += char
return [v for v in map(lambda x: x.strip(), new_string.split(" ")) if v != ""]
Maybe you could improve this by using a sliding window to check for spaces after emojis and only add spaces where necessary, but that would assume there will only ever be one space, where as this solution should account for 0 to n spaces between emojis.
Upvotes: 2
Reputation: 5931
there was some problems with @con-- answer, so I fixed it.
def emoji_splitter(text):
new_string = ""
text = text.lstrip()
if text:
new_string += text[0] + " "
for char in ' '.join(text[1:].split()):
new_string += char
if char in emoji.UNICODE_EMOJI:
new_string = new_string + " "
return list(map(lambda x: x.strip(), new_string.split()))
example:
emoji_splitter(' a๐น๐น๐น ads')
Out[7]: ['a', '๐น', '๐น', '๐น', 'ads']
Upvotes: 1