Reputation: 333
I wrote a script that extracts all emojis from a given dataset:
for message in df['Message']:
for char in message:
if char in emoji.UNICODE_EMOJI:
print(char)
It kinda works and correctly identifies which characters are emojis. However, the output does not correctly parse some of the emojis and they simply show up as brown square:
🏽
Why is this happening? Is there any way of solving this? Most emojis show up just fine but there are a few that just won't.
Edit: After looking into it again, it seems like the brown squares come with certain emojis to state the used color tone.
However, some there are still some issues with certain emojis. The usual heart emoji, for example does show up as a heart character but not in the emoji style. Screenshot because pasting it here ends up displaying it correctly:
Upvotes: 1
Views: 1726
Reputation: 10995
The issue is that dark skin tones (and color variants in general) are encoded as two separate symbols instead of one, i.e.
👍🏿
results from the two symbols 👍 🏿
(second gives the color).
You can see it from this example:
df = pd.DataFrame({"Message": ["test 👍🏿 "]})
for message in df['Message']:
for char in message:
if char in emoji.UNICODE_EMOJI:
print(char)
👍
🏿
So you will have to use regex (as per this answer):
import regex
df = pd.DataFrame({"Message": ["test 👍🏿 ", "test 2 👍 👍"]})
def split_count(text):
emoji_list = []
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_list.append(word)
return emoji_list
for message in df['Message']:
counter = split_count(message)
print(' '.join(emoji for emoji in counter))
output:
👍🏿
👍 👍
Upvotes: 3