Reputation: 3
I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.
I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.
I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:
>>> b'\x41'.decode('utf-8')
'A'
In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.
Here are the two solutions I'm working with:
>>> str(b'htmltext')
"b'htmltext'"
>>> b'htmltext'.decode('utf-8')
'htmltext'
Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?
Upvotes: 0
Views: 4896
Reputation: 1121276
Don't confuse the developer-friendly representation of the bytes
object with the data that is contained in it. You have binary data either way.
The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh
escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.
You'll have a harder time when the data is not within the ASCII range however:
>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'
That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...'
bytes version of that is not nearly as readable as the properly decoded text version:
>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'
Note that the b'....'
output is the result of using the repr()
function on a bytes
object; that calls the object.__repr__()
method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__()
method on a bytes
object, the __repr__
method is called instead, even when you use the str()
function. The proper way to convert a bytes
value to a string is to decode (using the correct codec for the data).
Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes
. There is no text to decode there.
Upvotes: 5