Reputation: 8500
From this question and its answers Python str vs unicode types I understood that unicode.encode()
gives you str
and str.decode()
gives you unicode
:
a = 'à'
ua = u'à'
print type(a) # str
print type(ua) # unicode
print ua.encode('utf-8') == a # True
print a.decode('utf-8') == ua # True
But I don't understand the purpose of unicode.decode()
and str.encode()
methods. What are they supposed to return? How can I use them? Both following lines are failing with UnicodeDecodeError
or UnicodeEncodeError
:
print ua.decode('utf-8')
print a.encode('utf-8')
Upvotes: 1
Views: 1461
Reputation: 530922
TL;DR Using unicode.decode
and str.encode
means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.
A unicode
value is a single Unicode code point: an integer interpreted as a particular character. A str
, on the other hand, is a sequence of bytes.
For example, à
is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.
The unicode.encode
method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.
>>> ua.encode('utf-8')
'\xc3\xa0'
str.decode
takes a byte string and attempts to produce the equivalent Unicode value.
>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'
(u'\xe0'
is equivalent to u'à'
).
As for your errors: Python 2 doesn't enforce a strict separation between how unicode
and str
are used. It doesn't really make sense to encode a str
if it is already an encoded value, and it doesn't make sense to decode a unicode
value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes
is a string of bytes (corresponding to Python 2 str
), and str
is a Unicode string (corresponding to Python 2 unicode
). The "nonsensical" methods don't even exist in Python 3:
>>> bytes.encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'
So your attempts that raised Unicode*Error
exceptions before now would simply raise an AttributeError
.
If you are stuck supporting Python 2, just follow these rules:
unicode
is for textstr
is for binary dataunicode.encode
produces a str
valuestr.decode
produces a unicode
valuestr.encode
, you are using the wrong type.unicode.decode
, you are using the wrong type.Upvotes: 3