user7179686
user7179686

Reputation:

Stripping every and all emoji from a sentence string

Working environment Python version:

Python 3.6.1

I've tried a number of methods outlined here on StackOverflow and other places around the internet - yet I still can't seem to get this working.

I could have any string...and the emojis may or may not be surrounded by whitespace, may be within " or after a hashtag etc etc...anyways, these circumstances are giving me some troubles.

This is what I have:

import sys
sys.maxunicode

emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  
                           u"\U0001F300-\U0001F5FF"
                           u"\U0001F680-\U0001F6FF"
                           u"\U0001F1E0-\U0001F1FF"
                           "]+", flags=re.UNICODE)

text = "" #This could be any text with or without emojis
text = emoji_pattern.sub(r'', text)

The above however when displayed or printed still have the emojis within the text.

text is a unicode string i.e., type(text) returns <type 'unicode'>

So what am I missing? I seem to have emojis remaining. I would also prefer a method that reflects that these Unicode designations could be expanded upon in the future so I would rather just have a method that keeps all regular characters.

Encoding the text as 'unicode_escape' gives the following:

b'[1/2] Can you see yourself as Prompto or Aranea?\\nGet higher quality images from our FB page \\n\\u2b07\\ufe0f\\u2026'

The raw unformatted text is:

[1/2] Can you see yourself as Prompto or Aranea?
Get higher quality images from our FB page
⬇️…

Upvotes: 2

Views: 1113

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 178179

Not sure what you think sys.maxunicode does, but your code works with Python 3.6. Are you sure you have all the emoji ranges covered?

import re

emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  
                           u"\U0001F300-\U0001F5FF"
                           u"\U0001F680-\U0001F6FF"
                           u"\U0001F1E0-\U0001F1FF"
                           "]+", flags=re.UNICODE)

text = 'Actual text with emoji: ->\U0001F620\U0001F310\U0001F690\U0001F1F0<-'
print(text)
text = emoji_pattern.sub(r'', text)
print(text)

Output:

Actual text with emoji: ->😠🌐🚐🇰<-
Actual text with emoji: -><-

Note that flags=re.UNICODE is the default in Python 3.6, so it is not needed. Unicode strings are also the default, so u"xxxx" can just be "xxxx".

Upvotes: 1

Related Questions