Reputation: 3431
I recently had a problem where I needed to extract all Emojis in a string to count the occurrence of specific Emojis. The Emoji python package let me extract all Emojis, but I always got specific modifiers such as Skin tones extracted as separate Emojis. I wanted to ignore Skin tones and other Fitzpatrick modifiers Variant Selectors (see this page for types and background on Fitzpatrick from Wikpedia). The following code will result in Fitzpatrick modifiers selected as separate emojis (which is not what I need):
import emoji
def extract_emojis(str):
return list(c for c in str if c in emoji.UNICODE_EMOJI)
Example: this emoji ❤️
is actually composed of two parts, a heart (Unicode Codepoint: U+2764
) and a modifier for red (Unicode Codepoint: U+fe0f
). print(repr('❤️'))
results in: \u2764\ufe0f - two separate unicodes but only one emoji. The second code point alone does not make sense on its own, yet it is returned as a separate emoji in the list from return list(c for c in str if c in emoji.UNICODE_EMOJI)
.
Upvotes: 3
Views: 1230
Reputation: 3431
Here is a solution to ignore Skin tones and other modifiers and treat all these emoji variations as one emoji. The answer from Martijn Pieters here helped writing the following solution to my problem:
import emoji
import unicodedata
def checkEmojiType(strEmo):
if unicodedata.name(strEmo).startswith("EMOJI MODIFIER"):
return False
else:
return True
def extract_emojis(str):
return list(c for c in str if c in emoji.UNICODE_EMOJI and checkEmojiType(c))
[edit] However..At the moment, Zero-Width Joiners (see comment below) seem not supported by the solution above. You can test it yourself with the following code:
n = '👨⚕️' #copy the medical emoji with zero-width joiner (http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html). This should only fall back to a double-emoji if not otherwise available
#extract all emojis with the function from above
nlist = def_functions.extract_emojis(n)
for xstr in nlist:
#print codepoints
print('Emoji Extract: U+%04x' % ord(xstr))
for _c in n:
#print all Unicode Codepoints directly
print('Each Codepoint: U+%04x' % ord(_c))
This is the output:
EmojiExtract: U+1f468
EmojiExtract: U+2695
Each Codepoint: U+1f468
Each Codepoint: U+200d
Each Codepoint: U+2695
Each Codepoint: U+fe0f
Emoji Extract didn't join the two Emojis (which would be somehow expected).
Upvotes: 3