Reputation: 12574
What function can I apply to a string variable that will cause the same result as prepending the b
modifier to a string literal?
I've read in this question about the b
modifier for string literals in Python 2 that prepending b
to a string makes it a byte string (mainly for compatibility between Python 2 and Python 3 when using 2to3
). The result I would like to obtain is the same, but applied to a variable, like so:
def is_binary_string_equal(string_variable):
binary_string = b'this is binary'
return convert_to_binary(string_variable) == binary_string
>>> convert_to_binary('this is binary')
[1] True
What is the correct definition of convert_to_binary
?
Upvotes: 3
Views: 3090
Reputation: 89
Note that in python 3.7, executed on linux machine, it is not the same to use .encode('UTF-8')
and b'string' .
It cause a lot of pain in a project of mine and to this day I have no clear understanding of why it happens but doing this in Python 3.7
print('\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45'.encode('UTF-8'))
print(b'\xAD\x43\x48\x49\x44\x44\x49\x4E\x47\x53\x54\x4F\x4E\x45')
returns this on console
b'\xc2\xadCHIDDINGSTONE'
b'\xadCHIDDINGSTONE'
Upvotes: 0
Reputation: 366123
First, note that in Python 2.x, the b
prefix actually does nothing. b'foo'
and 'foo'
are both exactly the same string literal. The b
only exists to allow you to write code that's compatible with both Python 2.x and Python 3.x: you can use b'foo'
to mean "I want bytes in both versions", and u'foo'
to mean "I want Unicode in both versions", and just plain 'foo'
to mean "I want the default str
type in both versions, even though that's Unicode in 3.x and bytes in 2.x".
So, "the functional equivalent of prepending the 'b' character to a string literal in Python 2" is literally doing nothing at all.
But let's assume that you actually have a Unicode string (like what you get out of a plain literal or a text file in Python 3, even though in Python 2 you can only get these by explicitly decoding, or using some function that does it for you, like opening a file with codecs.open
). Because then it's an interesting question.
The short answer is: string_variable.encode(encoding)
.
But before you can do that, you need to know what encoding you want. You don't need that with a literal string, because when you use the b
prefix in your source code, Python knows what encoding you want: the same encoding as your source code file.* But everything other than your source code—files you open and read, input the user types, messages coming in over a socket—could be anything, and Python has no idea; you have to tell it.**
In many cases (especially if you're on a reasonably recent non-Windows machine and dealing with local data), it's safe to assume that the answer is UTF-8, so you can spell convert_to_binary_string(string_variable)
as string_variable.encode('utf8')
. But "many" isn't "all".*** This is why text editors and web browsers let the user select an encoding—because sometimes only the user actually knows.
* See PEP 263 for how you can specify the encoding, and why you'd want to..
** You can also use bytes(s, encoding)
, which is a synonym for s.encode(encoding)
. And, in both cases, you can leave off the encoding
argument—but then it defaults to something which is more likely to be ASCII than what you actually wanted, so don't do that.
*** For example, many older network protocols are defined as Latin-1. Many Windows text files are created in whatever the OEM charset is set to—usually cp1252 on American systems, but there are hundreds of other possibilities. Sometimes sys.getdefaultencoding()
or locale.getpreferredencoding()
gets what you want, but that obviously doesn't work when, say, you're processing a file that someone uploaded that's in his machine's preferred encoding, not yours.
In the special case where the relevant encoding is "whatever this particular source file is in", you pretty much have to know that somehow out-of-band.* Once a script or module has been compiled and loaded, it's no longer possible to tell what encoding it was originally in.**
But there shouldn't be much reason to want that. After all, if two binary strings are equal, and in the same encoding, the Unicode strings are also equal, and vice-versa, so you could just write your code as:
def is_binary_string_equal(string_variable):
binary_string = u'this is binary'
return string_variable == binary_string
* The default is, of course, documented—it's UTF-8 for 3.0, ASCII or Latin-1 for 2.x depending on your version. But you can override that, as PEP 263 explains.
** Well, you could use the inspect
module to find the source, then the importlib
module to start processing it, etc.—but that only works if the file is still there and hasn't been edited since you last compiled it.
Upvotes: 5