Jose Torres
Jose Torres

Reputation: 1055

Substitute Emoji with its description or name

I'm working on getting a subset of emojis from a text retrieved form an API. What I'd like to do is substitute each emoji for its description or name.

I'm working on Python 3.4 and my current approach is accesing the unicode's name with unicodedata like this:

nname = unicodedata.name(my_unicode)

And I'm substituting with re.sub:

re.sub('[\U0001F602-\U0001F64F]', 'new string', str(orig_string))

I've tried re.search and then accessing matches and replacing strings (don't work with regex) but haven't been able to solve this.

Is there a way of getting a callback for each substitution that re.sub does? Any other route is also appreciated.

Upvotes: 3

Views: 3004

Answers (4)

jfs
jfs

Reputation: 414215

In Python 3.5+, there is namereplace error handler. You could use it to convert several emoticons at once:

>>> import re
>>> my_text ="\U0001F601, \U0001F602, ♥ and all of this \U0001F605"
>>> re.sub('[\U0001F601-\U0001F64F]+',
...        lambda m: m.group().encode('ascii', 'namereplace').decode(), my_text)
'\\N{GRINNING FACE WITH SMILING EYES}, \\N{FACE WITH TEARS OF JOY}, ♥ and all of this \\N{SMILING FACE WITH OPEN MOUTH AND COLD SWEAT}'

There are more Unicode characters that are emoji than the regex pattern indicates e.g., ♥ (U+2665 BLACK HEART SUIT).

Upvotes: 0

lemonhead
lemonhead

Reputation: 5518

You can pass in a function as the repl parameter of re.sub()

It is passed the match object and returns what you want to spit out:

input = 'I am \U0001F604 and not \U0001F613'
re.sub('[\U0001F602-\U0001F64F]', lambda y: unicodedata.name(y.group(0)), input)
# Outputs:
# 'I am SMILING FACE WITH OPEN MOUTH AND SMILING EYES and not FACE WITH COLD SWEAT'

Upvotes: 2

tobias_k
tobias_k

Reputation: 82899

You can pass a callback function to re.sub: From the documentation:

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; [...] If repl is a function, it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.

So just use unicodedata.name as the callback:

>>> my_text ="\U0001F602  and all of this \U0001F605"
>>> re.sub('[\U0001F602-\U0001F64F]', lambda m: unicodedata.name(m.group()), my_text)
'FACE WITH TEARS OF JOY  and all of this SMILING FACE WITH OPEN MOUTH AND COLD SWEAT'

Upvotes: 4

A.H
A.H

Reputation: 1025

No so clean, but works:

import unicodedata

my_text ="\U0001F602  and all of this \U0001F605"

for char in range(ord("\U0001F602"),ord("\U0001F64F")):
    my_text=my_text.replace(chr(char),unicodedata.name(chr(char),"NOTHING")) 

print(my_text)

result : FACE WITH TEARS OF JOY and all of this SMILING FACE WITH OPEN MOUTH AND COLD SWEAT

Upvotes: 0

Related Questions