Reputation: 6832
I know there are many questions out there already concerning encoding/decoding. But this is driving me nuts and I'm in desperate need of some help.
I read in a file converting the lines to unicode
line = unicode(line,'latin-1')
Then, I do some mutations and try to write the contents back to a file, encoding the string like this
o_str = '%s,%s' % (new_sname, loc )
w_out.write(o_str.encode('latin-1'))
The file contains for instance the city name 'Genève' which is u'Gen\xc3\xa8ve'
as unicode. Encoding it as 'Latin-1'
gue = gu.encode('iso-8859-1')
gives me on the console
>>> print gue
Genève
But in file my file it still is 'Genève'. Can somebody point me to what I am missing?
Upvotes: 0
Views: 6374
Reputation: 1124060
You are decoding UTF-8 data as Latin 1, use the correct codec instead:
>>> 'Gen\xc3\xa8ve'.decode('latin1')
u'Gen\xc3\xa8ve'
>>> print 'Gen\xc3\xa8ve'.decode('latin1')
Genève
>>> 'Gen\xc3\xa8ve'.decode('utf8')
u'Gen\xe8ve'
>>> print 'Gen\xc3\xa8ve'.decode('utf8')
Genève
The correct Unicode codepoint for the è
letter is U+00E8
, represented by \u00e8
or \xe8
in a Python Unicode literal, and the hex bytes C3A8 in UTF-8. Misintepreting C3 A8 leads to two unicode characters Ã
and ¨
, which you then write back to your file as C3 and A8 again because Latin1 maps one-on-one with Unicode.
Upvotes: 3