Reputation: 5
I'm trying to write a function that takes a decimal and converts it to a hexadecimal escape sequence in reverse order. The code I wrote works for most numbers, like the one in the example, but randomly, it adds an extra byte \xC2 or \xC3 at the start. I assume this is because of a special way how utf-8 works, but need it to have exactly 4 bytes. From testing it seems that it happens every other 128 numbers, with it switching to \xC3 starting at the half point of the section
I could systematically remove the extra byte that gets added, but that seems like random and there has to be a better way to do this So what is the reason behind this random extra byte and how I can prevent it from happening, or should I just systematically remove it
Here is the function I wrote:
def convert_int_to_reverse_hex_escape_sequence(decimal):
# example of the variable in comments # decimal = 275
hexadecimal = hex(decimal) # 0x113
padded = hexadecimal[2:].zfill(8) # 00000113
array = re.findall('..', padded) # ['00', '00', '01', '13']
array.reverse() # ['13', '01', '00', '00']
unicode = ''.join([chr(int(x, 16)) for x in array]).encode('utf-8') # b'\x13\x01\x00\x00'
return unicode
Upvotes: 0
Views: 1034
Reputation: 177901
UTF-8 encodes any Unicode codepoint >128 (0x7F) in two or more bytes, so when the result of chr(x,16)
is >128 you will see your issue:
>>> ''.join(chr(int(x,16)) for x in ['80','90','A0','B0']).encode('utf8')
b'\xc2\x80\xc2\x90\xc2\xa0\xc2\xb0'
latin1
would do what you want as it converts characters 0-255 to bytes 0-255 on a 1:1 basis:
>>> ''.join(chr(int(x,16)) for x in ['80','90','A0','B0']).encode('latin1')
b'\x80\x90\xa0\xb0'
But there is a built-in function for your use case. Tell it how many bytes you want and little- or big-endian:
>>> x = 275
>>> x.to_bytes(4,'little')
b'\x13\x01\x00\x00'
Upvotes: 1