Ervis Ilikeyoutoo
Ervis Ilikeyoutoo

Reputation: 3

Python - Unicode

The execution of a simple script is not going as thought.

notAllowed = {"â":"a", "à":"a", "é":"e", "è":"e", "ê":"e",
              "î":"i", "ô":"o", "ç":"c", "û":"u"}

word = "dôzerté"
print word

for char in word:
    if char in notAllowed.keys():
        print "hooray"
        word = word.replace(char, notAllowed[char])


print word
print "finished"

The output return the word unchanged, even though it should have changed "ô" and "é" to o and e, thus returning dozerte...

Any ideas?

Upvotes: 0

Views: 1158

Answers (2)

Simon
Simon

Reputation: 12478

Iterating a string iterates its bytes, not necessarily its characters. If the encoding of your python source file is utf-8, len(word) will be 9 insted of 7 (both special characters have a two-byte encoding). Iterating a unicode string (u"dôzerté") iterates characters, so that should work.

May I also suggest you use unidecode for the task you're trying to achieve?

Upvotes: 2

kgr
kgr

Reputation: 9948

How about:

# -*- coding: utf-8 -*-
notAllowed = {u"â":u"a", u"à":u"a", u"é":u"e", u"è":u"e", u"ê":u"e",
          u"î":u"i", u"ô":u"o", u"ç":u"c", u"û":u"u"}

word = u"dôzerté"
print word

for char in word:
if char in notAllowed.keys():
    print "hooray"
    word = word.replace(char, notAllowed[char])


print word
print "finished"

Basically, if you want to assign an unicode string to some variable you need to use:

u"..." 
#instead of just
"..."

to denote the fact that this is the unicode string.

Upvotes: 2

Related Questions