Reputation: 17511
If I do this in python:
>>> name = "âțâîâ"
>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'
>>> len(name)
10
>>> u = name.decode('utf-8')
>>> len (u)
5
>>>
What is the default encoding in python if you don't specify any ?
Upvotes: 1
Views: 448
Reputation: 91017
Probably you are using Python 2. (If not, this answer is bad.)
What happens is the following:
>>> name = "âțâîâ"
You assign to name
a (byte) string whose contents is determined by your encoding of the terminal resp. of your text editor. In your case, this is obviously UTF8.
These bytes are shown with
>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'
Only if you decode it with
>>> u = name.decode('utf-8')
you get a unicode string. Here you specify that encoding.
A simpler and more reliably way would be to directly do
u = u"âțâîâ"
and only then extract the bytes according to your wanted encoding:
name = u.encode("utf-8")
Upvotes: 1
Reputation: 1121226
You are specifying a python string literal, and their encoding is determined by the default settings of your editor (or in the case of the python interpreter, of your terminal). Python did not have a say in this.
By default, python 2 tries to interpret source code as ASCII. In python 3 this has been switched to UTF-8.
Please read the Python Unicode HOWTO to further understand the difference between Unicode and input and output encodings. You really also should read Joel Spolksy's article on Unicode.
Upvotes: 2