Julien
Julien

Reputation: 528

Converting "literal" unicode character to an equivalent without emphasising

I have as an input one string containing "literal" unicode characters.

"I want to replace \u00c6 with AE and \u00d5 with O"

note : \u00c6 = Æ \u00d5 = Ö

So with my python script, I can easyly replace one character :

>>> print("I want to replace \u00c6 with AE and \u00d5 with O".replace(u"\u00c6","AE"))
I want to replace AE with AE and Õ with O

But what if I want to replace them all ? (there is only 2 in the example, but we can imagine we have to search for 50 charaters to be replaced.

I tried to use a dict to do the matching, but this does not seem to work

#input  : "\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6"
#output (expected) : "A A A O A A AE 0"

import sys

unicode_table = {
   '\u00c0': 'A',  #À
   '\u00c1': 'A',  #Á
   '\u00c2': 'A',  #Â
   '\u00c3': 'A',  #Ã
   '\u00c4': 'A',  #Ä
   '\u00c5': 'A',  #Å
   '\u00c6': 'AE', #Æ
   '\u00d2': 'O',  #Ò
   '\u00d3': 'O',  #Ó
   '\u00d4': 'O',  #Ô
   '\u00d5': 'O',  #Õ
   '\u00d6': 'O'   #Ö
   #this may go on much further
}

result = sys.argv[1]

for key in unicode_table:
   #print(key + unicode_table[key])
   result = result.replace(key,unicode_table[key])

print(result)

output :

[puppet@damageinc python]$ python replace_unicode.py "\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6"
\u00c0 \u00c1 \u00c2 \u00d2 \u00c4 \u00c5 \u00c6 \u00d6

Any help appreciated ! Thanks.

edit : Two solutions with the comments, thanks

1st : reencode the string with unicode_escape :

result = sys.argv[1].encode().decode('unicode_escape')

2nd : use module unidecode, just to avoid rediscovering the wheel

import sys
from unidecode import unidecode

result = sys.argv[1].encode().decode('unicode_escape')
print(unidecode(result))

Upvotes: 0

Views: 627

Answers (1)

Czaporka
Czaporka

Reputation: 2407

Your Python code works as expected, it's your shell that doesn't render the escape sequences, i.e. the Python script receives literally "\u00c0" instead of "À", etc.

You should try testing it with some actual unicode strings, or maybe tweak your command by adding e.g. printf or echo -e to render the escape sequences before passing them to the script:

python replace_unicode.py "$(printf '\u00c0 ... \u00d6')"

Upvotes: 1

Related Questions