Python Unicode behaviour

Question

I have a word êtes in two files and I tried converting it into different formats.

1) I opened the file with codecs.open('test1.txt',encoding='ISO-8859-2') and then did word.encode('utf-8'). The word read as \xc4\x99tes

2) I opened another file with the same word, but with codecs.open('test2.txt',encoding='utf-8'). This time the word read as \xeates

Shouldn't both be giving the same output??

Mark Ransom · Accepted Answer

No they should not give the same output. The first will be a byte string, and the second will be a Unicode string.

It appears your first file is encoded with ISO-8859-1, not ISO-8859-2. The ê (\xea) is being translated into ę (\u0119) instead, and its UTF-8 representation is the two bytes \xc4\x99.

The second file appears to be properly encoded in UTF-8. If you want to see the actual character rather than its hex representation you need to print it.

Python Unicode behaviour

Answers (1)

Related Questions