Eduard Florinescu
Eduard Florinescu

Reputation: 17511

What type of represantation is default in python to store Unicode strings?

If I do this in python:

>>> name = "âțâîâ"
>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'
>>> len(name)
10
>>> u = name.decode('utf-8')
>>> len (u)
5
>>>

What is the default encoding in python if you don't specify any ?

Upvotes: 1

Views: 448

Answers (2)

glglgl
glglgl

Reputation: 91017

Probably you are using Python 2. (If not, this answer is bad.)

What happens is the following:

>>> name = "âțâîâ"

You assign to name a (byte) string whose contents is determined by your encoding of the terminal resp. of your text editor. In your case, this is obviously UTF8.

These bytes are shown with

>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'

Only if you decode it with

>>> u = name.decode('utf-8')

you get a unicode string. Here you specify that encoding.

A simpler and more reliably way would be to directly do

u = u"âțâîâ"

and only then extract the bytes according to your wanted encoding:

name = u.encode("utf-8")

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121226

You are specifying a python string literal, and their encoding is determined by the default settings of your editor (or in the case of the python interpreter, of your terminal). Python did not have a say in this.

By default, python 2 tries to interpret source code as ASCII. In python 3 this has been switched to UTF-8.

Please read the Python Unicode HOWTO to further understand the difference between Unicode and input and output encodings. You really also should read Joel Spolksy's article on Unicode.

Upvotes: 2

Related Questions