Reputation: 13
I am trying to analyse names from a dataset that uses different (mixed) encodings within the dataset. It contains place names, among much other geospatial data. After running the script, I get a list of place names matching geo-locations, i.e. [u'BR', u'BR-ES', u'BRA', u'Brasil', u'Brazil', u'ES', u'Espirito Santo', u'Esp\xedrito Santo', u'Federative Republic of Brazil', u'Guarapari', u'Rep\xfablica Federativa do Brasil', u'gpxupload.py']
. So far all good. But sometimes the dataset give me results such as u'Taubat\u0102\u0160'
which in the analyzation is treated as TaubatĂŠ
instead of the correct value Taubaté
, the previous example produces Espírito Santo
and Republic Federative do Brasil
Is there a way to capture \u0102\u0160
and converts it to \xe9
without having to create individual .replace()
rules for each letter?
Upvotes: 0
Views: 193
Reputation: 177755
u'Taubat\u0102\u0160'
was decoded with the wrong codec. It was actually UTF-8, but decoded as 'iso-8859-2'. Ideally, decode it correctly in the first place, but the following backs out the error:
>>> u'Taubat\u0102\u0160'.encode('iso-8859-2').decode('utf8')
u'Taubat\xe9'
>>> print(u'Taubat\u0102\u0160'.encode('iso-8859-2').decode('utf8'))
Taubaté
Upvotes: 1