Reputation: 1162
I have some questions about encoding in python 2.7.
1.The python code is as below,
#s = u"严"
s = u'\u4e25'
print 's is:', s
print 'len of s is:', len(s)
s1 = "a" + s
print 's1 is:', s1
print 'len of s1 is:', len(s1)
the output is:
s is: 严
len of s is: 1
s1 is: a严
len of s1 is: 2
I am confused that why the len of s
is 1, how could 4e25
be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s
's length is 1?
2.
(1)New a file named a.py
with notepad++(Windows 7), and set the file's encoding ANSI
, code in a.py
is as below:
# -*- encoding:utf-8 -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)
the output is:
ascii
s: 严
type of s: <type 'str'>
(2)New a file named b.py
with notepad++(Windows 7), and set the file's encoding UTF-8
, code in b.py
is as below:
# -*- encoding:gbk -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)
the output is:
File "D:\pyws\code\\b.py", line 1
SyntaxError: encoding problem: utf-8
(3)change file b.py
as below(the file's encoding style is UTF-8
):
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)
the output is:
ascii
s: 涓
type of s: <type 'str'>
(4)change file a.py
as below(the file's encoding style is ANSI
):
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)
the output is:
File "D:\pyws\code\a1.py", line 3
SyntaxError: Non-ASCII character '\xd1' in file D:\pyws\code\a1.py on
line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html f
or details
Why these 4 cases' outputs in question2 are different? Anybody can figure it out in detail?
Upvotes: 1
Views: 6755
Reputation: 177674
In Python versions <3.3, length for a Unicode string u''
is the number of UTF-16 or UTF-32 code units used (depending on build flags), not the number of bytes. \u4e25
is one code unit, but not all characters are represented by one code unit if UTF-16 (default on Windows) is used.
>>> len(u'\u42e5')
1
>>> len(u'\U00010123')
2
In Python 3.3, the above will return 1 for both functions.
Also Unicode characters can be composed of combining code units, such as é
. The normalize
function can be used to generate the combined or decomposed form:
>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.normalize('NFD',u'\xe9')
u'e\u0301'
>>> ud.normalize('NFC',u'e\u0301')
u'\xe9'
So even in Python 3.3, a single display character can have 1 or more code units, and it is best to normalize to one form or another for consistent answers.
The encoding declared at the top of the file must agree with the encoding in which the file is saved. The declaration lets Python know how to interpret the bytes in the file.
For example, the character 严
is saved as 3 bytes in a file saved as UTF-8, but two bytes in a file saved as GBK:
>>> u'严'.encode('utf8')
'\xe4\xb8\xa5'
>>> u'严'.encode('gbk')
'\xd1\xcf'
If you declare the wrong encoding, the bytes are interpreted incorrectly and Python either displays the wrong characters or throws an exception.
Edit per comment
2(1) - This is system dependent due to ANSI being the system locale default encoding. On my system that is cp1252
and Notepad++ can't display a Chinese character. If I set my system locale to Chinese(PRC)
then I get your results on a console terminal. The reason it works correctly in that case is a byte string is used and the bytes are just sent to the terminal. Since the file was encoded in ANSI
on a Chinese(PRC)
locale, the bytes the byte string contains are correctly interpreted by the Chinese(PRC)
locale terminal.
2(2) - The file is encoded in UTF-8 but the encoding is declared as GBK. When Python reads the encoding it tries to interpret the file as GBK and fails. You've chosen UTF-8
as the encoding, which on Notepad++ also includes a UTF-8 encoded byte order mark (BOM) as the first character in the file and the GBK codec doesn't read it as a valid GBK-encoded character, so fails on line 1.
2(3) - The file is encoded in UTF-8 (with BOM), but missing an encoding declaration. Python recognizes the UTF-8-encoded BOM and uses UTF-8 as the encoding, but the file is in GBK. Since a byte string was used, the UTF-8-encoded bytes are sent to the GBK terminal and you get:
>>> u'严'.encode('utf8').decode(
'\xe4\xb8\xa5'
>>> '\xe4\xb8'.decode('gbk')
u'\u6d93'
>>> print '\xe4\xb8'.decode('gbk')
涓
In this case I am surprised, because Python is ignoring the byte \xa5
, and as you see below when I explicitly decode incorrectly Python throws an exception:
>>> u'严'.encode('utf8').decode('gbk')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 2: incomplete multibyte sequence
2(4) - In this case, then encoding is ANSI (GBK) but no encoding is declared, and there is no BOM like in UTF-8 to give Python a hint, so it assumes ASCII and can't handle the GBK-encoded character on line 3.
Upvotes: 1
Reputation: 251383
I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s's length is 1?
The whole point of unicode strings is to do this. The length of a unicode string is the number of characters (i.e., code points), not the number of bytes. The number of bytes may vary depending on the encoding, but the number of characters is an abstract invariant that doesn't change with encoding.
As for your second question, the answer is that in setting a file's encoding, you are telling Python how to map bytes in that file to characters. If you specify an encoding (with the # encoding
syntax) that is inconsistent with the file's actual encoding, you will get unpredictable behavior, because Python is trying to interpret the bytes one way, but the file is set up so the bytes actually mean something else.
The kind of behavior you get will depend on the specifics of the encodings you use. Some possibilities are:
# encoding
directive to that effect in the file?)Also, the type of the string is str
in all cases, because you didn't specify the string as being unicode (e.g., with u"..."
). Specifying a file encoding doesn't make strings unicode. It just tells Python how to interpret the characters in the file.
However, there's a bigger question here, which is: why are you playing those games with encodings in your examples? There is no reason whatsoever to use an # encoding
marker to specify an encoding other than the one the file is actually encoded in, and doing so is guaranteed to cause problems. Don't do it. You have to know what encoding the file is in, and specify that same encoding in the # encoding
marker.
Upvotes: 5