sanyassh
sanyassh

Reputation: 8500

Python2: what do str.encode() and unicode.decode() do?

From this question and its answers Python str vs unicode types I understood that unicode.encode() gives you str and str.decode() gives you unicode:

a = 'à'
ua = u'à'
print type(a)  # str
print type(ua)  # unicode
print ua.encode('utf-8') == a  # True
print a.decode('utf-8') == ua  # True

But I don't understand the purpose of unicode.decode() and str.encode() methods. What are they supposed to return? How can I use them? Both following lines are failing with UnicodeDecodeError or UnicodeEncodeError:

print ua.decode('utf-8')
print a.encode('utf-8')

Upvotes: 1

Views: 1461

Answers (1)

chepner
chepner

Reputation: 530922

TL;DR Using unicode.decode and str.encode means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.


A unicode value is a single Unicode code point: an integer interpreted as a particular character. A str, on the other hand, is a sequence of bytes.

For example, à is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.

The unicode.encode method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.

>>> ua.encode('utf-8')
'\xc3\xa0'

str.decode takes a byte string and attempts to produce the equivalent Unicode value.

>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'

(u'\xe0' is equivalent to u'à').


As for your errors: Python 2 doesn't enforce a strict separation between how unicode and str are used. It doesn't really make sense to encode a str if it is already an encoded value, and it doesn't make sense to decode a unicode value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes is a string of bytes (corresponding to Python 2 str), and str is a Unicode string (corresponding to Python 2 unicode). The "nonsensical" methods don't even exist in Python 3:

>>> bytes.encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'

So your attempts that raised Unicode*Error exceptions before now would simply raise an AttributeError.

If you are stuck supporting Python 2, just follow these rules:

  • unicode is for text
  • str is for binary data
  • unicode.encode produces a str value
  • str.decode produces a unicode value
  • If you find yourself trying to call str.encode, you are using the wrong type.
  • If you find yourself trying to call unicode.decode, you are using the wrong type.

Upvotes: 3

Related Questions