Ravindra S
Ravindra S

Reputation: 6422

Python emoji search and replace not working as expected

I am trying to separate emoji in given text from other characters/words/emojis. I want to use emoji later as features in text classification. So it is important that I treat each emoji in the sentence individually and as a separate character.

The code:

import re

text = "I am very #happy man but😘😘 my wife😞 is not 😊😘"
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]+', 
    re.UNICODE)

#padding the emoji with space at both the ends
new_text = reg.sub(' \1 ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emoji in new_text
new_text2 = reg.sub('#\1#', new_text) 
print(new_text2) # line c

Here is the actual output:

enter image description here

(I had to paste the screenshot because copy pasting the output here from terminal was distorting those already distorted emojis in line b and c)

Here is my expected output:

I am very #happy man but😘😘 my wife😞 is not 😊😘
I am very #happy man but 😘  😘  my wife 😞  is not  😊  😘 
I am very #happy man but #😘#  #😘#  my wife #😞#  is not  #😊#  #😘# 

Questions:

1) Why the search and replace is not working as expected? What is the emoji being replaced with? (line b). It is definitely not the unicode for the original emoji otherwise line c would have printed the emoji with # padded at the both ends.

2) I am not sure if I am right about this but - Why are the grouped emoji being replaced with a single emoji/unicode? (line b)

Upvotes: 2

Views: 879

Answers (1)

Wiktor StribiΕΌew
Wiktor StribiΕΌew

Reputation: 626689

There are several issues here.

  • There is no capturing groups in the regex pattern, but in the replacement pattern, you define \1 backreference to Group 1 - so, the most natural workaround is to use a backreference to Group 0, i.e. the whole match, that is \g<0>.
  • The \1 in the replacement is not actually parsed as a backreference, but as a a char with an octal value 1 because the backslash in the regular (not raw) string literals forms escape sequences. Here, it is an octal escape.
  • The + after the ] means that the regex engine must match 1 or more occurrences of text matching the character class, so you match sequences of emojis rather than each separate emoji.

Use

import re

text = "I am very #happy man but😘😘 my wife😞 is not 😊😘"
print(text) #line a

reg = re.compile(u'['
    u'\U0001F300-\U0001F64F'
    u'\U0001F680-\U0001F6FF'
    u'\u2600-\u26FF\u2700-\u27BF]', 
    re.UNICODE)

#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',text) 
print(new_text) #line b

# this is just to test if it can still identify the emojis in new_text
new_text2 = reg.sub(r'#\g<0>#', new_text) 
print(new_text2) # line c

See the Python demo printing

I am very #happy man but😘😘 my wife😞 is not 😊😘
I am very #happy man but 😘  😘  my wife 😞  is not  😊  😘 
I am very #happy man but #😘#  #😘#  my wife #😞#  is not  #😊#  #😘# 

Upvotes: 4

Related Questions