Nadine
Nadine

Reputation: 797

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

When running the following code (which just prints out file names):

print filename

It throws the following error:

File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)

So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.

The script only fails when trying read a filename such as Coé.jpg.

Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?

NB. I'm a python noob

Upvotes: 1

Views: 1591

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1124848

filename is already encoded. It is already a byte string and doesn't need encoding again.

But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:

>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').

You probably need to read up on Python and Unicode. I recommend:

The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

Upvotes: 1

Related Questions