EagleOne
EagleOne

Reputation: 238

python: converting from unknown char encoding (assume iso-8859-1) to unicode

I'm decoding an xml file with xml.etree and one of the elements contains this string:

Exécutive

I tried pretty much everything to figure out how to tranform it to its real value:

Exécutive

I tried the following:

>>> s = 'é'

>>> s
'\xc3\x83\xc2\xa9'

>>> print s
é

>>> type(s)
<type 'str'>

>>> s.decode('iso-8859-1')
u'\xc3\x83\xc2\xa9'

>>> print( s.decode('iso-8859-1').encode('utf-8'))
é

>>> print( s.decode('utf-8'))
é

I'm kind of lost here with these encodings. Anyone for a little help?

Thanks in advance

Upvotes: 2

Views: 3233

Answers (1)

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201618

The data is apparently UTF-8 encoded data (e.g., “é” is two bytes) misinterpreted as ISO-8859-1. For the test case, the following produces the output “Exécutive”:

# This Python file uses the following encoding: utf-8
s = 'Exécutive'
print s.decode('utf-8')

In processing the XML file, you probably just need to read it as UTF-8.

Upvotes: 2

Related Questions