anon_swe
anon_swe

Reputation: 9355

Python: UnicodeDecodeError with Default Encoding of ASCII

I'm doing some text processing in Python 2.7 with default encoding of ASCII. I'm getting a UnicodeDecodeError when trying to encode some of my strings into utf-8. Specifically, for each word in my document, I do this:

word = word.encode('utf-8')

This works well when my characters are all ASCII but when they're not, I get:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 5: ordinal not in range(128)

I'm confused, since I thought calling encode would turn everything from ASCII into utf-8. Since utf-8 is a superset of ASCII, I shouldn't get any issues...but I do.

Also, I'm not sure why it says that ASCII can't decode when I would expect it to say that ASCII can't encode my word into utf-8.

Any help would be awesome!

Upvotes: 0

Views: 389

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 178389

You encode to byte strings, decode to Unicode strings. So to encode to a UTF-8 byte string, start with a Unicode string. If you start with a byte string, Python 2.7 implicitly decodes it to Unicode using the default ASCII codec first. If your byte string contains non-ASCII, you then get a UnicodeDecodeError.

Python 3 removes the implicit decode to Unicode when you start with a byte string, and in fact .encode() is not available on byte strings and .decode is not available on Unicode strings. Python 3 also changes the default encoding to UTF-8.

Examples:

Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode('utf8')  # Started with a byte string
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 3: ordinal not in range(128)
>>> u'café'.encode('utf8')  # Started with Unicode string
'caf\xc3\xa9'

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'café'.encode()  # Starting with a Unicode string, default UTF-8.
b'caf\xc3\xa9'
>>> 'café'.decode()  # You can only *encode* Unicode strings.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

Further reading: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Upvotes: 2

Related Questions