Python convert unicode to ASCII

Question

I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:

u'Atl\xc3\xa9tico Madrid'

In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid". If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?

Martijn Pieters · Accepted Answer

You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:

>>> print u'Atl\xc3\xa9tico Madrid'
AtlÃ©tico Madrid

Repair your string first:

>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid

and Unidecode will give you what you expected:

>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'

Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.

Python convert unicode to ASCII

Answers (1)

Related Questions