dimitris93
dimitris93

Reputation: 4273

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:

from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'

I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.

import regex as re
print u'\U0001F469'                     # 👩   
print u'\U0001F60C'                     # 😌    
print u'\U0001F469\U0001F60C'           # 👩😌 

text = u'some\U0001F469\U0001F60Cthing' 
print text                              # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text)  # something
# Removing only "👩" doesn't work 
print re.sub(ur'[\U0001f469]+', u'', text)            # some�thing

Upvotes: 8

Views: 2432

Answers (3)

Mark Ransom
Mark Ransom

Reputation: 308530

In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').

The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.

To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.

subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627507

To remove all emojis from the input string using the current approach, use

import re
from emoji import UNICODE_EMOJI
text = u'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
rx = ur"(?:{})+".format("|".join(map(re.escape,exclude_list)))
print re.sub(rx, u'', text)
# => u'something'

If you do not re.escape the emoji chars, you will get nothing to repeat error due to the literal chars messing up with the alternation operators inside the group, so map(re.escape,exclude_list) is required.

Tested in Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2.

Upvotes: 2

Jongware
Jongware

Reputation: 22478

The old 2.7 regex engine gets confused because:

  1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

  2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

  3. That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text)  # something
# Removing only "👩" doesn't work 
print re.sub(ur'(\U0001f469)+', u'', text)            # some�thing
# .. and now it does:
some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()

for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
        print 'Removing '+bad
        text = text.replace(bad, '')
Removing 👩
Removing 😌
something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

Upvotes: 2

Related Questions