imsrch
imsrch

Reputation: 1162

Encoding in python 2.7

I have some questions about encoding in python 2.7.

1.The python code is as below,

#s = u"严"
s = u'\u4e25'
print 's is:', s
print 'len of s is:', len(s)
s1 = "a" + s
print 's1 is:', s1
print 'len of s1 is:', len(s1)

the output is:

s is: 严
len of s is: 1
s1 is: a严
len of s1 is: 2

I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s's length is 1?

2. (1)New a file named a.py with notepad++(Windows 7), and set the file's encoding ANSI, code in a.py is as below:

# -*- encoding:utf-8 -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 严
type of s: <type 'str'>

(2)New a file named b.py with notepad++(Windows 7), and set the file's encoding UTF-8, code in b.py is as below:

# -*- encoding:gbk -*-
import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\\b.py", line 1
SyntaxError: encoding problem: utf-8

(3)change file b.py as below(the file's encoding style is UTF-8):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

ascii
s: 涓
type of s: <type 'str'>

(4)change file a.py as below(the file's encoding style is ANSI):

import sys
print sys.getdefaultencoding()
s = "严"
print "s:", s
print "type of s:", type(s)

the output is:

  File "D:\pyws\code\a1.py", line 3
SyntaxError: Non-ASCII character '\xd1' in file D:\pyws\code\a1.py on
line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html f
or details

Why these 4 cases' outputs in question2 are different? Anybody can figure it out in detail?

Upvotes: 1

Views: 6755

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177674

Answer to Question 1:

In Python versions <3.3, length for a Unicode string u'' is the number of UTF-16 or UTF-32 code units used (depending on build flags), not the number of bytes. \u4e25 is one code unit, but not all characters are represented by one code unit if UTF-16 (default on Windows) is used.

>>> len(u'\u42e5')
1
>>> len(u'\U00010123')
2

In Python 3.3, the above will return 1 for both functions.

Also Unicode characters can be composed of combining code units, such as é. The normalize function can be used to generate the combined or decomposed form:

>>> import unicodedata as ud
>>> ud.name(u'\xe9')
'LATIN SMALL LETTER E WITH ACUTE'
>>> ud.normalize('NFD',u'\xe9')
u'e\u0301'
>>> ud.normalize('NFC',u'e\u0301')
u'\xe9'

So even in Python 3.3, a single display character can have 1 or more code units, and it is best to normalize to one form or another for consistent answers.

Answer to Question 2:

The encoding declared at the top of the file must agree with the encoding in which the file is saved. The declaration lets Python know how to interpret the bytes in the file.

For example, the character is saved as 3 bytes in a file saved as UTF-8, but two bytes in a file saved as GBK:

>>> u'严'.encode('utf8')
'\xe4\xb8\xa5'
>>> u'严'.encode('gbk')
'\xd1\xcf'

If you declare the wrong encoding, the bytes are interpreted incorrectly and Python either displays the wrong characters or throws an exception.

Edit per comment

2(1) - This is system dependent due to ANSI being the system locale default encoding. On my system that is cp1252 and Notepad++ can't display a Chinese character. If I set my system locale to Chinese(PRC) then I get your results on a console terminal. The reason it works correctly in that case is a byte string is used and the bytes are just sent to the terminal. Since the file was encoded in ANSI on a Chinese(PRC) locale, the bytes the byte string contains are correctly interpreted by the Chinese(PRC) locale terminal.

2(2) - The file is encoded in UTF-8 but the encoding is declared as GBK. When Python reads the encoding it tries to interpret the file as GBK and fails. You've chosen UTF-8 as the encoding, which on Notepad++ also includes a UTF-8 encoded byte order mark (BOM) as the first character in the file and the GBK codec doesn't read it as a valid GBK-encoded character, so fails on line 1.

2(3) - The file is encoded in UTF-8 (with BOM), but missing an encoding declaration. Python recognizes the UTF-8-encoded BOM and uses UTF-8 as the encoding, but the file is in GBK. Since a byte string was used, the UTF-8-encoded bytes are sent to the GBK terminal and you get:

>>> u'严'.encode('utf8').decode(
'\xe4\xb8\xa5'
>>> '\xe4\xb8'.decode('gbk')
u'\u6d93'
>>> print '\xe4\xb8'.decode('gbk')
涓

In this case I am surprised, because Python is ignoring the byte \xa5, and as you see below when I explicitly decode incorrectly Python throws an exception:

>>> u'严'.encode('utf8').decode('gbk')
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa5 in position 2: incomplete multibyte sequence

2(4) - In this case, then encoding is ANSI (GBK) but no encoding is declared, and there is no BOM like in UTF-8 to give Python a hint, so it assumes ASCII and can't handle the GBK-encoded character on line 3.

Upvotes: 1

BrenBarn
BrenBarn

Reputation: 251383

I am confused that why the len of s is 1, how could 4e25 be stored in 1 byte? I also notice that USC-2 is 2-bytes long and USC-4 is 4-bytes long, why unicode string s's length is 1?

The whole point of unicode strings is to do this. The length of a unicode string is the number of characters (i.e., code points), not the number of bytes. The number of bytes may vary depending on the encoding, but the number of characters is an abstract invariant that doesn't change with encoding.

As for your second question, the answer is that in setting a file's encoding, you are telling Python how to map bytes in that file to characters. If you specify an encoding (with the # encoding syntax) that is inconsistent with the file's actual encoding, you will get unpredictable behavior, because Python is trying to interpret the bytes one way, but the file is set up so the bytes actually mean something else.

The kind of behavior you get will depend on the specifics of the encodings you use. Some possibilities are:

  1. You'll get lucky and it will work even though you use conflicting encodings; this is what happened in your first case.
  2. It will raise an error because the bytes in the file aren't consistent with the specified encoding; this is what happened in your second case.
  3. It will seem to work, but produce different characters, because the bytes in the file's actual encoding mean something else when interpreted with the specified encoding. This seems to be what happened in your third case, although it ought to raise an error since that character isn't ASCII. (By "the file's encoding style is UTF-8" did you mean you set an # encoding directive to that effect in the file?)
  4. If you don't specify any encoding, you'll get an error if you try to use any bytes that aren't in plain ASCII. This is what happened in your last case.

Also, the type of the string is str in all cases, because you didn't specify the string as being unicode (e.g., with u"..."). Specifying a file encoding doesn't make strings unicode. It just tells Python how to interpret the characters in the file.

However, there's a bigger question here, which is: why are you playing those games with encodings in your examples? There is no reason whatsoever to use an # encoding marker to specify an encoding other than the one the file is actually encoded in, and doing so is guaranteed to cause problems. Don't do it. You have to know what encoding the file is in, and specify that same encoding in the # encoding marker.

Upvotes: 5

Related Questions