Reputation: 528
I have as an input one string containing "literal" unicode characters.
"I want to replace \u00c6 with AE and \u00d5 with O"
note : \u00c6 = Æ \u00d5 = Ö
So with my python script, I can easyly replace one character :
>>> print("I want to replace \u00c6 with AE and \u00d5 with O".replace(u"\u00c6","AE"))
I want to replace AE with AE and Õ with O
But what if I want to replace them all ? (there is only 2 in the example, but we can imagine we have to search for 50 charaters to be replaced.
I tried to use a dict to do the matching, but this does not seem to work
#input : "\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6"
#output (expected) : "A A A O A A AE 0"
import sys
unicode_table = {
'\u00c0': 'A', #À
'\u00c1': 'A', #Á
'\u00c2': 'A', #Â
'\u00c3': 'A', #Ã
'\u00c4': 'A', #Ä
'\u00c5': 'A', #Å
'\u00c6': 'AE', #Æ
'\u00d2': 'O', #Ò
'\u00d3': 'O', #Ó
'\u00d4': 'O', #Ô
'\u00d5': 'O', #Õ
'\u00d6': 'O' #Ö
#this may go on much further
}
result = sys.argv[1]
for key in unicode_table:
#print(key + unicode_table[key])
result = result.replace(key,unicode_table[key])
print(result)
output :
[puppet@damageinc python]$ python replace_unicode.py "\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6"
\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6
Any help appreciated ! Thanks.
edit : Two solutions with the comments, thanks
1st : reencode the string with unicode_escape :
result = sys.argv[1].encode().decode('unicode_escape')
2nd : use module unidecode, just to avoid rediscovering the wheel
import sys
from unidecode import unidecode
result = sys.argv[1].encode().decode('unicode_escape')
print(unidecode(result))
Upvotes: 0
Views: 627
Reputation: 2407
Your Python code works as expected, it's your shell that doesn't render the escape sequences, i.e. the Python script receives literally "\u00c0" instead of "À", etc.
You should try testing it with some actual unicode strings, or maybe tweak your command by adding e.g. printf
or echo -e
to render the escape sequences before passing them to the script:
python replace_unicode.py "$(printf '\u00c0 ... \u00d6')"
Upvotes: 1