Reputation: 1055
I'm working on getting a subset of emojis from a text retrieved form an API. What I'd like to do is substitute each emoji for its description or name.
I'm working on Python 3.4 and my current approach is accesing the unicode's name with unicodedata like this:
nname = unicodedata.name(my_unicode)
And I'm substituting with re.sub:
re.sub('[\U0001F602-\U0001F64F]', 'new string', str(orig_string))
I've tried re.search and then accessing matches and replacing strings (don't work with regex) but haven't been able to solve this.
Is there a way of getting a callback for each substitution that re.sub does? Any other route is also appreciated.
Upvotes: 3
Views: 3004
Reputation: 414215
In Python 3.5+, there is namereplace
error handler. You could use it to convert several emoticons at once:
>>> import re
>>> my_text ="\U0001F601, \U0001F602, ♥ and all of this \U0001F605"
>>> re.sub('[\U0001F601-\U0001F64F]+',
... lambda m: m.group().encode('ascii', 'namereplace').decode(), my_text)
'\\N{GRINNING FACE WITH SMILING EYES}, \\N{FACE WITH TEARS OF JOY}, ♥ and all of this \\N{SMILING FACE WITH OPEN MOUTH AND COLD SWEAT}'
There are more Unicode characters that are emoji than the regex pattern indicates e.g., ♥ (U+2665 BLACK HEART SUIT).
Upvotes: 0
Reputation: 5518
You can pass in a function as the repl parameter of re.sub()
It is passed the match object and returns what you want to spit out:
input = 'I am \U0001F604 and not \U0001F613'
re.sub('[\U0001F602-\U0001F64F]', lambda y: unicodedata.name(y.group(0)), input)
# Outputs:
# 'I am SMILING FACE WITH OPEN MOUTH AND SMILING EYES and not FACE WITH COLD SWEAT'
Upvotes: 2
Reputation: 82899
You can pass a callback function to re.sub
: From the documentation:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; [...] If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
So just use unicodedata.name
as the callback:
>>> my_text ="\U0001F602 and all of this \U0001F605"
>>> re.sub('[\U0001F602-\U0001F64F]', lambda m: unicodedata.name(m.group()), my_text)
'FACE WITH TEARS OF JOY and all of this SMILING FACE WITH OPEN MOUTH AND COLD SWEAT'
Upvotes: 4
Reputation: 1025
No so clean, but works:
import unicodedata
my_text ="\U0001F602 and all of this \U0001F605"
for char in range(ord("\U0001F602"),ord("\U0001F64F")):
my_text=my_text.replace(chr(char),unicodedata.name(chr(char),"NOTHING"))
print(my_text)
result : FACE WITH TEARS OF JOY and all of this SMILING FACE WITH OPEN MOUTH AND COLD SWEAT
Upvotes: 0