Y.Su
Y.Su

Reputation: 406

Character encoding between Python2 and Python3

I have a string x defined as below

x = b'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

In iPython2

In [10]: x
Out[10]: 'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

In [11]: print(x)
LF                                                           � 2020 by S&P Global Inc.,200523

In [12]: x.decode('ISO-8859-1')
Out[12]: u'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

In [13]: print(x.decode('ISO-8859-1'))
LF                                                           © 2020 by S&P Global Inc.,200523

Question 1: why is the output for x and print(x) different? The same between x.decode('ISO-8859-1') and print(x.decode('ISO-8859-1')).

In iPython3

In [3]: x                                                                                                                                                                                           
Out[3]: b'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

In [4]: print(x)                                                                                                                                                                                    
b'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

In [5]: x.decode('ISO-8859-1')                                                                                                                                                                      
Out[5]: 'LF                                                           © 2020 by S&P Global Inc.,200523\n'

In [7]: print(x.decode('ISO-8859-1'))                                                                                                                                                               
LF                                                           © 2020 by S&P Global Inc.,200523

Question 2: As you can see, in Python3, the output for x and print(x) are the same. So are x.decode('ISO-8859-1') and print(x.decode('ISO-8859-1')). In Python2, it is not the case. Why is this distinction between Python2 and Python3?

Question 3: why the output of print(x) in Python 2 and 3 are different, the output of x is the same?

Question 4: why the output of x.decode('ISO-8859-1') in Python 2 and 3 are different, but print are the same?

Upvotes: 1

Views: 79

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40878

Question 1: why is the output for x and print(x) different?

Just typing x into a REPL can be thought of as:

>>> print repr(x)
'LF                                                           \xa9 2020 by S&P Global Inc.,200523\n'

Question 2: As you can see, in Python3, the output for x and print(x) are the same. So are x.decode('ISO-8859-1') and print(x.decode('ISO-8859-1')). In Python2, it is not the case. Why is this distinction between Python2 and Python3?

Because x is a bytes object in Python 3, where print() will not attempt to decode the bytestring. Python 3 bytes representation display binary values over 127 using the corresponding escape sequence.

Question 3: why the output of print(x) in Python 2 and 3 are different, the output of x is the same?

Because repr(x) gives the same thing on Python 2 and 3.

Question 4: why the output of x.decode('ISO-8859-1') in Python 2 and 3 are different, but print are the same?

Because x.decode('ISO-8859-1') in Python 2 produces a unicode object in Python 2 and a str object in Python 3, whose __repr__() differ in how they display non-ASCII.


If you want a more thorough read on all of this, check out Unicode & Character Encodings in Python: A Painless Guide. (Disclosure: I wrote it.)

Upvotes: 1

Related Questions