Reputation: 6422
I am trying to separate emoji in given text from other characters/words/emojis. I want to use emoji later as features in text classification. So it is important that I treat each emoji in the sentence individually and as a separate character.
The code:
import re
text = "I am very #happy man butππ my wifeπ is not ππ"
print(text) #line a
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]+',
re.UNICODE)
#padding the emoji with space at both the ends
new_text = reg.sub(' \1 ',text)
print(new_text) #line b
# this is just to test if it can still identify the emoji in new_text
new_text2 = reg.sub('#\1#', new_text)
print(new_text2) # line c
Here is the actual output:
(I had to paste the screenshot because copy pasting the output here from terminal was distorting those already distorted emojis in line b and c)
Here is my expected output:
I am very #happy man butππ my wifeπ is not ππ
I am very #happy man but π π my wife π is not π π
I am very #happy man but #π# #π# my wife #π# is not #π# #π#
Questions:
1) Why the search and replace is not working as expected? What is the emoji being replaced with? (line b). It is definitely not the unicode for the original emoji otherwise line c would have printed the emoji with # padded at the both ends.
2) I am not sure if I am right about this but - Why are the grouped emoji being replaced with a single emoji/unicode? (line b)
Upvotes: 2
Views: 879
Reputation: 626689
There are several issues here.
\1
backreference to Group 1 - so, the most natural workaround is to use a backreference to Group 0, i.e. the whole match, that is \g<0>
.\1
in the replacement is not actually parsed as a backreference, but as a a char with an octal value 1 because the backslash in the regular (not raw) string literals forms escape sequences. Here, it is an octal escape.+
after the ]
means that the regex engine must match 1 or more occurrences of text matching the character class, so you match sequences of emojis rather than each separate emoji.Use
import re
text = "I am very #happy man butππ my wifeπ is not ππ"
print(text) #line a
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]',
re.UNICODE)
#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text)
print(new_text) #line b
# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text)
print(new_text2) # line c
See the Python demo printing
I am very #happy man butππ my wifeπ is not ππ
I am very #happy man but π π my wife π is not π π
I am very #happy man but #π# #π# my wife #π# is not #π# #π#
Upvotes: 4