Reputation: 21771
Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> str_version = 'នយោបាយ'
>>> type(str_version)
<class 'str'>
>>> print (str_version)
នយោបាយ
>>> unicode_version = 'នយោបាយ'.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
unicode_version = 'នយោបាយ'.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
>>>
What the problem with my unicode string?
Upvotes: 8
Views: 15592
Reputation:
You already have a unicode string. In Python 3, str
are unicode strings (unicode
in Python 2.x), and single-byte strings (Python 2.x str
) aren't treated as text anymore, they're now called bytes
. The latter can be converted into a str
with its decode
method, but the former is already decoded - you can only encode it back into bytes
.
Upvotes: 3
Reputation: 89547
There is nothing wrong with your string! You just have confused encode()
and decode()
. The string is meaningful symbols. To turn it into bytes that could be stored in a file or transmitted over the Internet, use encode()
with an encoding like UTF-8. Each encoding is a scheme for converting meaningful symbols to flat bytes of output.
When the time comes to do the opposite — to take some raw bytes from a file or a socket and turn them into symbols like letters and numbers — you will decode the bytes using the decode()
method of bytestrings in Python 3.
>>> str_version = 'នយោបាយ'
>>> str_version.encode('utf-8')
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
See that big long line of bytes? Those are the bytes that UTF-8 uses to represent your string, if you need to transmit the string over a network, or store them in a document. There are many other encodings in use, but it seems to be the most popular. Each encoding can turn meaningful symbols like ន and យោ into bytes — the little 8-bit numbers with which computers communicate.
>>> rawbytes = str_version.encode('utf-8')
>>> rawbytes
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
>>> rawbytes.decode('utf-8')
'នយោបាយ'
Upvotes: 10
Reputation: 799250
You're reading the 2.x docs. str.decode()
(and bytes.encode()
) was dropped in 3.x. And str
is already a Unicode string; there's no need to decode it.
Upvotes: 7