Reputation: 2697
I am in need of a way to get the binary representation of a string in python. e.g.
st = "hello world"
toBinary(st)
Is there a module of some neat way of doing this?
Upvotes: 179
Views: 638577
Reputation: 19644
Here is a comparison of bit lengths in various encodings of ASCII 127 (delete). Note the respective 24, 16, and 32 bit byte order mark (BOM) in UTF-8-SIG, UTF-16, and UTF-32:
>>> for encoding in ('utf-8', 'utf-8-sig', 'utf-16', 'utf-16-le', 'utf-16-be', 'utf-32', 'utf-32-le', 'utf-32-be'): print(''.join(' '.join((f'{encoding:9}', f'{len(bs):2}', bs)) for bs in [''.join(f'{byte:08b}' for byte in '\x7f'.encode(encoding))]))
...
utf-8 8 01111111
utf-8-sig 32 11101111101110111011111101111111
utf-16 32 11111111111111100111111100000000
utf-16-le 16 0111111100000000
utf-16-be 16 0000000001111111
utf-32 64 1111111111111110000000000000000001111111000000000000000000000000
utf-32-le 32 01111111000000000000000000000000
utf-32-be 32 00000000000000000000000001111111
Upvotes: 0
Reputation: 23
''.join(format(i, 'b') for i in bytearray(str, encoding='utf-8'))
This works okay since its easy to now revert back to the string as no zeros will be added to reach the 8 bits to form a byte hence easy to revert to string to avoid complexity of removing the zeros added.
Upvotes: 0
Reputation: 107287
If by binary you mean bytes
type, you can just use encode
method of the string object that encodes your string as a bytes object using the passed encoding type. You just need to make sure you pass a proper encoding to encode
function.
In [9]: "hello world".encode('ascii')
Out[9]: b'hello world'
In [10]: byte_obj = "hello world".encode('ascii')
In [11]: byte_obj
Out[11]: b'hello world'
In [12]: byte_obj[0]
Out[12]: 104
Otherwise, if you want them in form of zeros and ones --binary representation-- as a more pythonic way you can first convert your string to byte array then use bin
function within map
:
>>> st = "hello world"
>>> map(bin,bytearray(st))
['0b1101000', '0b1100101', '0b1101100', '0b1101100', '0b1101111', '0b100000', '0b1110111', '0b1101111', '0b1110010', '0b1101100', '0b1100100']
Or you can join it:
>>> ' '.join(map(bin,bytearray(st)))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'
Note that in python3 you need to specify an encoding for bytearray
function :
>>> ' '.join(map(bin,bytearray(st,'utf8')))
'0b1101000 0b1100101 0b1101100 0b1101100 0b1101111 0b100000 0b1110111 0b1101111 0b1110010 0b1101100 0b1100100'
You can also use binascii
module in python 2:
>>> import binascii
>>> bin(int(binascii.hexlify(st),16))
'0b110100001100101011011000110110001101111001000000111011101101111011100100110110001100100'
hexlify
return the hexadecimal representation of the binary data then you can convert to int by specifying 16 as its base then convert it to binary with bin
.
Upvotes: 131
Reputation: 250961
Something like this?
>>> st = "hello world"
>>> ' '.join(format(ord(x), 'b') for x in st)
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'
#using `bytearray`
>>> ' '.join(format(x, 'b') for x in bytearray(st, 'utf-8'))
'1101000 1100101 1101100 1101100 1101111 100000 1110111 1101111 1110010 1101100 1100100'
Upvotes: 169
Reputation: 89587
In Python version 3.6 and above you can use f-string to format result.
str = "hello world"
print(" ".join(f"{ord(i):08b}" for i in str))
01101000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
The left side of the colon, ord(i), is the actual object whose value will be formatted and inserted into the output. Using ord() gives you the base-10 code point for a single str character.
The right hand side of the colon is the format specifier. 08 means width 8, 0 padded, and the b functions as a sign to output the resulting number in base 2 (binary).
Upvotes: 9
Reputation: 1
a = list(input("Enter a string\t: "))
def fun(a):
c =' '.join(['0'*(8-len(bin(ord(i))[2:]))+(bin(ord(i))[2:]) for i in a])
return c
print(fun(a))
Upvotes: -2
Reputation: 2269
def method_a(sample_string):
binary = ' '.join(format(ord(x), 'b') for x in sample_string)
def method_b(sample_string):
binary = ' '.join(map(bin,bytearray(sample_string,encoding='utf-8')))
if __name__ == '__main__':
from timeit import timeit
sample_string = 'Convert this ascii strong to binary.'
print(
timeit(f'method_a("{sample_string}")',setup='from __main__ import method_a'),
timeit(f'method_b("{sample_string}")',setup='from __main__ import method_b')
)
# 9.564299999998184 2.943955828988692
method_b is substantially more efficient at converting to a byte array because it makes low level function calls instead of manually transforming every character to an integer, and then converting that integer into its binary value.
Upvotes: 3
Reputation: 22754
This is an update for the existing answers which used bytearray()
and can not work that way anymore:
>>> st = "hello world"
>>> map(bin, bytearray(st))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: string argument without an encoding
Because, as explained in the link above, if the source is a string, you must also give the encoding:
>>> map(bin, bytearray(st, encoding='utf-8'))
<map object at 0x7f14dfb1ff28>
Upvotes: 2
Reputation: 1302
You can access the code values for the characters in your string using the ord()
built-in function. If you then need to format this in binary, the string.format()
method will do the job.
a = "test"
print(' '.join(format(ord(x), 'b') for x in a))
(Thanks to Ashwini Chaudhary for posting that code snippet.)
While the above code works in Python 3, this matter gets more complicated if you're assuming any encoding other than UTF-8. In Python 2, strings are byte sequences, and ASCII encoding is assumed by default. In Python 3, strings are assumed to be Unicode, and there's a separate bytes
type that acts more like a Python 2 string. If you wish to assume any encoding other than UTF-8, you'll need to specify the encoding.
In Python 3, then, you can do something like this:
a = "test"
a_bytes = bytes(a, "ascii")
print(' '.join(["{0:b}".format(x) for x in a_bytes]))
The differences between UTF-8 and ascii encoding won't be obvious for simple alphanumeric strings, but will become important if you're processing text that includes characters not in the ascii character set.
Upvotes: 16