What is the point of .decode()

Question

>>> infile = urllib.request.urlopen("http://www.yahoo.com")

With decoding:

>>>infile.read(100).decode()

'

Martijn Pieters · Accepted Answer

No, the output is not the same; one is a Unicode value, the other an undecoded bytes value.

For ASCII, that looks the same, but when you load any web page that uses characters outside the ASCII characterset, the difference will be much clearer.

Take UTF-8 encoded data, for example:

>>> '–'
'–'
>>> '–'.encode('utf8')
b'\xe2\x80\x93'

That's a simple U+2013 EN DASH character. The bytes representation shows the 3 bytes UTF-8 uses to encode the codepoint.

You really want to read up on Unicode vs. encoded data here, I recommend:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

What is the point of .decode()

Answers (1)

Related Questions