jatrp5
jatrp5

Reputation: 31

How to turn a binary string into a byte?

If I take the letter 'à' and encode it in UTF-8 I obtain the following result:

'à'.encode('utf-8')
>> b'\xc3\xa0'

Now from a bytearray I would like to convert 'à' into a binary string and turn it back into 'à'. To do so I execute the following code:

byte = bytearray('à','utf-8')
for x in byte:
    print(bin(x))

I get 0b11000011and0b10100000, which is 195 and 160. Then, I fuse them together and take the 0b part out. Now I execute this code:

s = '1100001110100000'
value1 =  s[0:8].encode('utf-8')
value2 =  s[9:16].encode('utf-8')
value = value1 + value2
print(chr(int(value, 2)))
>> 憠

No matter how I develop the later part I get symbols and never seem to be able to get back my 'à'. I would like to know why is that? And how can I get an 'à'.

Upvotes: 1

Views: 1867

Answers (3)

Mark Tolonen
Mark Tolonen

Reputation: 177735

Convert the base-2 value back to an integer with int(s,2), convert that integer to a number of bytes (int.to_bytes) based on the original length divided by 8 and big-endian conversion to keep the bytes in the right order, then .decode() it (default in Python 3 is utf8):

>>> s = '1100001110100000'
>>> int(s,2)
50080
>>> int(s,2).to_bytes(len(s)//8,'big')
b'\xc3\xa0'
>>> int(s,2).to_bytes(len(s)//8,'big').decode()
'à'

Upvotes: 0

Joran Beasley
Joran Beasley

Reputation: 113988

you need your second bits to be s[8:16] (or just s[8:]) otherwise you get 0100000

you also need to convert you "bit string" back to an integer before thinking of it as a byte with int("0010101",2)

s = '1100001110100000'
value1 =  bytearray([int(s[:8],2), # bits 0..7 (8 total)
                     int(s[8:],2)] # bits 8..15 (8 total)
) 
print(value1.decode("utf8"))

Upvotes: 0

Mark Ransom
Mark Ransom

Reputation: 308206

>>> bytes(int(s[i:i+8], 2) for i in range(0, len(s), 8)).decode('utf-8')
'à'

There are multiple parts to this. The bytes constructor creates a byte string from a sequence of integers. The integers are formed from strings using int with a base of 2. The range combined with the slicing peels off 8 characters at a time. Finally decode converts those bytes back into Unicode characters.

Upvotes: 3

Related Questions