python randomly adds bytes to a string when encoding to utf-8

Question

I'm trying to write a function that takes a decimal and converts it to a hexadecimal escape sequence in reverse order. The code I wrote works for most numbers, like the one in the example, but randomly, it adds an extra byte \xC2 or \xC3 at the start. I assume this is because of a special way how utf-8 works, but need it to have exactly 4 bytes. From testing it seems that it happens every other 128 numbers, with it switching to \xC3 starting at the half point of the section

I could systematically remove the extra byte that gets added, but that seems like random and there has to be a better way to do this So what is the reason behind this random extra byte and how I can prevent it from happening, or should I just systematically remove it

Here is the function I wrote:

def convert_int_to_reverse_hex_escape_sequence(decimal):
    # example of the variable in comments                               # decimal = 275
    hexadecimal = hex(decimal)                                          # 0x113
    padded = hexadecimal[2:].zfill(8)                                   # 00000113
    array = re.findall('..', padded)                                    # ['00', '00', '01', '13']
    array.reverse()                                                     # ['13', '01', '00', '00']
    unicode = ''.join([chr(int(x, 16)) for x in array]).encode('utf-8') # b'\x13\x01\x00\x00'
    return unicode

Mark Tolonen · Accepted Answer

UTF-8 encodes any Unicode codepoint >128 (0x7F) in two or more bytes, so when the result of chr(x,16) is >128 you will see your issue:

>>> ''.join(chr(int(x,16)) for x in ['80','90','A0','B0']).encode('utf8')
b'\xc2\x80\xc2\x90\xc2\xa0\xc2\xb0'

latin1 would do what you want as it converts characters 0-255 to bytes 0-255 on a 1:1 basis:

>>> ''.join(chr(int(x,16)) for x in ['80','90','A0','B0']).encode('latin1')
b'\x80\x90\xa0\xb0'

But there is a built-in function for your use case. Tell it how many bytes you want and little- or big-endian:

>>> x = 275
>>> x.to_bytes(4,'little')
b'\x13\x01\x00\x00'

python randomly adds bytes to a string when encoding to utf-8

Answers (1)

Related Questions