Reputation: 89
Let say I have the following string: DATA = "🚀😘👍🏾🇦🇮"
.
I want to get an array or list with each single emoji as an element, like so [🚀,😘,👍🏾,🇦🇮]
.
The problem, however, is that the length of emojis vary. So len(u'😘')
is 1
, whereas len(u'👍🏾')
is 2.
So how would I split up my DATA
? I've seen it been done in JavaScript, but couldn't figure out a way to do it in Python (How can I split a string containing emoji into an array?).
Upvotes: 2
Views: 2765
Reputation: 178179
Using the 3rd party regex
module (pip install regex
) and Python 3.5:
>>> import regex
>>> s = '\U0001f680\U0001f618\U0001f44d\U0001f3fe\U0001f1e6\U0001f1ee'
>>> import unicodedata as ud
>>> ud.category(s[0])
'So'
>>> ud.category(s[1])
'So'
>>> ud.category(s[2])
'So'
>>> ud.category(s[3])
'Sk'
>>> ud.category(s[4])
'So'
>>> ud.category(s[5])
'So'
>>> regex.findall(r'\p{So}\p{Sk}*',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6', '\U0001f1ee']
The national flags are a two-letter regional indicator symbol from the range U+1F1E6 - U+1F1FF. It turns out regex
has a grapheme cluster \X
switch, but it finds the flags but not the skin tone marker.
>>> regex.findall(r'\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d', '\U0001f3fe', '\U0001f1e6\U0001f1ee']
However, you could look for symbol modifiers OR grapheme clusters:
>>> regex.findall(r'.\p{Sk}+|\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6\U0001f1ee']
There may be other exceptions.
Upvotes: 3
Reputation: 42647
If you want a Python version of the JavaScript solution in How can I split a string containing emoji into an array?, then this should do the trick:
import re
pattern = re.compile(r'([\uD800-\uDBFF][\uDC00-\uDFFF])')
def emojiString2List(text):
return list(x for x in pattern.split(text) if x != '')
Notice that Python's str.split()
method does not accept a regex (while JS's does), therefore you have to use the re
library to split using a regex. Also, by using a Python list comprehension, the code is much shorter, but the behavior should be identical. That said, I haven't fully tested the above code. At least it should get you pointed in the right direction.
Upvotes: 0