Reputation: 993
I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid". If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
Upvotes: 0
Views: 1551
Reputation: 1125058
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.
Upvotes: 9