Hashirun
Hashirun

Reputation: 89

Split a string of Emojis into single Emoji character

Let say I have the following string: DATA = "🚀😘👍🏾🇦🇮".

I want to get an array or list with each single emoji as an element, like so [🚀,😘,👍🏾,🇦🇮].

The problem, however, is that the length of emojis vary. So len(u'😘')is 1, whereas len(u'👍🏾') is 2.

So how would I split up my DATA? I've seen it been done in JavaScript, but couldn't figure out a way to do it in Python (How can I split a string containing emoji into an array?).

Upvotes: 2

Views: 2765

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 178179

Using the 3rd party regex module (pip install regex) and Python 3.5:

>>> import regex
>>> s = '\U0001f680\U0001f618\U0001f44d\U0001f3fe\U0001f1e6\U0001f1ee'
>>> import unicodedata as ud
>>> ud.category(s[0])
'So'
>>> ud.category(s[1])
'So'
>>> ud.category(s[2])
'So'
>>> ud.category(s[3])
'Sk'
>>> ud.category(s[4])
'So'
>>> ud.category(s[5])
'So'
>>> regex.findall(r'\p{So}\p{Sk}*',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6', '\U0001f1ee']

Edit:

The national flags are a two-letter regional indicator symbol from the range U+1F1E6 - U+1F1FF. It turns out regex has a grapheme cluster \X switch, but it finds the flags but not the skin tone marker.

>>> regex.findall(r'\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d', '\U0001f3fe', '\U0001f1e6\U0001f1ee']

However, you could look for symbol modifiers OR grapheme clusters:

>>> regex.findall(r'.\p{Sk}+|\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6\U0001f1ee']

There may be other exceptions.

Upvotes: 3

Waylan
Waylan

Reputation: 42647

If you want a Python version of the JavaScript solution in How can I split a string containing emoji into an array?, then this should do the trick:

import re

pattern = re.compile(r'([\uD800-\uDBFF][\uDC00-\uDFFF])')

def emojiString2List(text):
    return list(x for x in pattern.split(text) if x != '')

Notice that Python's str.split() method does not accept a regex (while JS's does), therefore you have to use the re library to split using a regex. Also, by using a Python list comprehension, the code is much shorter, but the behavior should be identical. That said, I haven't fully tested the above code. At least it should get you pointed in the right direction.

Upvotes: 0

Related Questions