Reputation: 31
If I take the letter 'à' and encode it in UTF-8 I obtain the following result:
'à'.encode('utf-8')
>> b'\xc3\xa0'
Now from a bytearray I would like to convert 'à' into a binary string and turn it back into 'à'. To do so I execute the following code:
byte = bytearray('à','utf-8')
for x in byte:
print(bin(x))
I get 0b11000011
and0b10100000
, which is 195 and 160. Then, I fuse them together and take the 0b
part out. Now I execute this code:
s = '1100001110100000'
value1 = s[0:8].encode('utf-8')
value2 = s[9:16].encode('utf-8')
value = value1 + value2
print(chr(int(value, 2)))
>> 憠
No matter how I develop the later part I get symbols and never seem to be able to get back my 'à'. I would like to know why is that? And how can I get an 'à'.
Upvotes: 1
Views: 1867
Reputation: 177735
Convert the base-2 value back to an integer with int(s,2)
, convert that integer to a number of bytes (int.to_bytes
) based on the original length divided by 8 and big-endian conversion to keep the bytes in the right order, then .decode()
it (default in Python 3 is utf8
):
>>> s = '1100001110100000'
>>> int(s,2)
50080
>>> int(s,2).to_bytes(len(s)//8,'big')
b'\xc3\xa0'
>>> int(s,2).to_bytes(len(s)//8,'big').decode()
'à'
Upvotes: 0
Reputation: 113988
you need your second bits to be s[8:16]
(or just s[8:]
) otherwise you get 0100000
you also need to convert you "bit string" back to an integer before thinking of it as a byte with int("0010101",2)
s = '1100001110100000'
value1 = bytearray([int(s[:8],2), # bits 0..7 (8 total)
int(s[8:],2)] # bits 8..15 (8 total)
)
print(value1.decode("utf8"))
Upvotes: 0
Reputation: 308206
>>> bytes(int(s[i:i+8], 2) for i in range(0, len(s), 8)).decode('utf-8')
'à'
There are multiple parts to this. The bytes
constructor creates a byte string from a sequence of integers. The integers are formed from strings using int
with a base of 2. The range
combined with the slicing peels off 8 characters at a time. Finally decode
converts those bytes back into Unicode characters.
Upvotes: 3