Encoding in python 2.7

Question

I have some questions about encoding in python 2.7.

1.The python code is as below,

#s = u"严"
s = u'\u4e25'
print 's is:', s
print 'len of s is:', len(s)
s1 = "a" + s
print 's1 is:', s1
print 'len of s1 is:', len(s1)

the output is:

s is: 严
len of s is: 1
s1 is: a严
len of s1 is: 2

I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s's length is 1?

2. (1)New a file named a.py with notepad++(Windows 7), and set the file's encoding ANSI, code in a.py is as below:

# -*- encoding:utf-8 -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 严
type of s:

(2)New a file named b.py with notepad++(Windows 7), and set the file's encoding UTF-8, code in b.py is as below:

# -*- encoding:gbk -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\b.py", line 1
SyntaxError: encoding problem: utf-8

(3)change file b.py as below(the file's encoding style is UTF-8):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 涓
type of s:

(4)change file a.py as below(the file's encoding style is ANSI):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\a1.py", line 3
SyntaxError: Non-ASCII character '\xd1' in file D:\pyws\code\a1.py on
line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html f
or details

Why these 4 cases' outputs in question2 are different? Anybody can figure it out in detail?

Mark Tolonen · Accepted Answer

Answer to Question 1:

In Python versions <3.3, length for a Unicode string u'' is the number of UTF-16 or UTF-32 code units used (depending on build flags), not the number of bytes. \u4e25 is one code unit, but not all characters are represented by one code unit if UTF-16 (default on Windows) is used.

>>> len(u'\u42e5')
1
>>> len(u'\U00010123')
2

In Python 3.3, the above will return 1 for both functions.

Also Unicode characters can be composed of combining code units, such as é. The normalize function can be used to generate the combined or decomposed form:

>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.normalize('NFD',u'\xe9')
u'e\u0301'
>>> ud.normalize('NFC',u'e\u0301')
u'\xe9'

So even in Python 3.3, a single display character can have 1 or more code units, and it is best to normalize to one form or another for consistent answers.

Answer to Question 2:

The encoding declared at the top of the file must agree with the encoding in which the file is saved. The declaration lets Python know how to interpret the bytes in the file.

For example, the character 严 is saved as 3 bytes in a file saved as UTF-8, but two bytes in a file saved as GBK:

>>> u'严'.encode('utf8')
'\xe4\xb8\xa5'
>>> u'严'.encode('gbk')
'\xd1\xcf'

If you declare the wrong encoding, the bytes are interpreted incorrectly and Python either displays the wrong characters or throws an exception.

Edit per comment

2(1) - This is system dependent due to ANSI being the system locale default encoding. On my system that is cp1252 and Notepad++ can't display a Chinese character. If I set my system locale to Chinese(PRC) then I get your results on a console terminal. The reason it works correctly in that case is a byte string is used and the bytes are just sent to the terminal. Since the file was encoded in ANSI on a Chinese(PRC) locale, the bytes the byte string contains are correctly interpreted by the Chinese(PRC) locale terminal.

2(2) - The file is encoded in UTF-8 but the encoding is declared as GBK. When Python reads the encoding it tries to interpret the file as GBK and fails. You've chosen UTF-8 as the encoding, which on Notepad++ also includes a UTF-8 encoded byte order mark (BOM) as the first character in the file and the GBK codec doesn't read it as a valid GBK-encoded character, so fails on line 1.

2(3) - The file is encoded in UTF-8 (with BOM), but missing an encoding declaration. Python recognizes the UTF-8-encoded BOM and uses UTF-8 as the encoding, but the file is in GBK. Since a byte string was used, the UTF-8-encoded bytes are sent to the GBK terminal and you get:

>>> u'严'.encode('utf8').decode(
'\xe4\xb8\xa5'
>>> '\xe4\xb8'.decode('gbk')
u'\u6d93'
>>> print '\xe4\xb8'.decode('gbk')
涓

In this case I am surprised, because Python is ignoring the byte \xa5, and as you see below when I explicitly decode incorrectly Python throws an exception:

>>> u'严'.encode('utf8').decode('gbk')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 2: incomplete multibyte sequence

2(4) - In this case, then encoding is ANSI (GBK) but no encoding is declared, and there is no BOM like in UTF-8 to give Python a hint, so it assumes ASCII and can't handle the GBK-encoded character on line 3.

Encoding in python 2.7

Answers (2)

Answer to Question 1:

Answer to Question 2:

Related Questions