tree em
tree em

Reputation: 21771

String In python with my unicode?

Python 3.2 (r32:88445, Feb 20 2011, 21:29:02) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> str_version = 'នយោបាយ'
>>> type(str_version)
<class 'str'>
>>> print (str_version)
នយោបាយ
>>> unicode_version = 'នយោបាយ'.decode('utf-8')
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    unicode_version = 'នយោបាយ'.decode('utf-8')
AttributeError: 'str' object has no attribute 'decode'
>>> 

What the problem with my unicode string?

Upvotes: 8

Views: 15592

Answers (3)

user395760
user395760

Reputation:

You already have a unicode string. In Python 3, str are unicode strings (unicode in Python 2.x), and single-byte strings (Python 2.x str) aren't treated as text anymore, they're now called bytes. The latter can be converted into a str with its decode method, but the former is already decoded - you can only encode it back into bytes.

Upvotes: 3

Brandon Rhodes
Brandon Rhodes

Reputation: 89547

There is nothing wrong with your string! You just have confused encode() and decode(). The string is meaningful symbols. To turn it into bytes that could be stored in a file or transmitted over the Internet, use encode() with an encoding like UTF-8. Each encoding is a scheme for converting meaningful symbols to flat bytes of output.

When the time comes to do the opposite — to take some raw bytes from a file or a socket and turn them into symbols like letters and numbers — you will decode the bytes using the decode() method of bytestrings in Python 3.

>>> str_version = 'នយោបាយ'
>>> str_version.encode('utf-8')
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'

See that big long line of bytes? Those are the bytes that UTF-8 uses to represent your string, if you need to transmit the string over a network, or store them in a document. There are many other encodings in use, but it seems to be the most popular. Each encoding can turn meaningful symbols like ន and យោ into bytes — the little 8-bit numbers with which computers communicate.

>>> rawbytes = str_version.encode('utf-8')
>>> rawbytes
b'\xe1\x9e\x93\xe1\x9e\x99\xe1\x9f\x84\xe1\x9e\x94\xe1\x9e\xb6\xe1\x9e\x99'
>>> rawbytes.decode('utf-8')
'នយោបាយ'

Upvotes: 10

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799250

You're reading the 2.x docs. str.decode() (and bytes.encode()) was dropped in 3.x. And str is already a Unicode string; there's no need to decode it.

Upvotes: 7

Related Questions