difference between chr() and bytes.decode

Question

Can someone explain why I can convert a bytes object to a str via

>>> bytes_ = b';\xf7\xb8W\xef\x0f\xf4V'
>>> list(bytes_)
[59, 247, 184, 87, 239, 15, 244, 86]
>>> "".join([chr(x) for x in bytes_])
';÷¸Wï\x0fôV'

But if I call

>>> bytes_.decode()
Traceback (most recent call last):
  File "", line 1, in 
    bytes_.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 1: invalid start byte

I get an error.

DYZ · Accepted Answer

The default encoding used by .decode() is UTF-8. However, at least some bytes in bytes_ do not correctly encode UTF-8 characters. On the other hand, chr(n) returns the n'th Unicode character by its ordinal number, not by encoding. If you want .decode() to work, you must tell it which encoding to use. For example, utf-16 seems to work:

bytes_.decode('utf-16')
#'\uf73b垸\u0fef围'

CP1252 works, too, but (expectedly) gives different results:

bytes_.decode('cp1252')
#';÷¸Wï\x0fôV'

difference between chr() and bytes.decode

Answers (1)

Related Questions