Analysing dataset with unicode characters with different encodings give strange results

Question

I am trying to analyse names from a dataset that uses different (mixed) encodings within the dataset. It contains place names, among much other geospatial data. After running the script, I get a list of place names matching geo-locations, i.e. [u'BR', u'BR-ES', u'BRA', u'Brasil', u'Brazil', u'ES', u'Espirito Santo', u'Esp\xedrito Santo', u'Federative Republic of Brazil', u'Guarapari', u'Rep\xfablica Federativa do Brasil', u'gpxupload.py']. So far all good. But sometimes the dataset give me results such as u'Taubat\u0102\u0160' which in the analyzation is treated as TaubatĂŠ instead of the correct value Taubaté, the previous example produces Espírito Santo and Republic Federative do Brasil

Is there a way to capture \u0102\u0160 and converts it to \xe9 without having to create individual .replace() rules for each letter?

Mark Tolonen · Accepted Answer

u'Taubat\u0102\u0160' was decoded with the wrong codec. It was actually UTF-8, but decoded as 'iso-8859-2'. Ideally, decode it correctly in the first place, but the following backs out the error:

>>> u'Taubat\u0102\u0160'.encode('iso-8859-2').decode('utf8')
u'Taubat\xe9'
>>> print(u'Taubat\u0102\u0160'.encode('iso-8859-2').decode('utf8'))
Taubaté

Analysing dataset with unicode characters with different encodings give strange results

Answers (1)

Related Questions