hipeople321
hipeople321

Reputation: 122

Custom Base64 Encoder Doesn't Encode Correctly

I decided to make my own Base64 encoder and decoder, despite there already being a module for this in the standard library. It's just meant to be a fun project. However, the encoder, for some reason, incorrectly encodes some characters, and I haven't had luck with debugging. I've tried to follow the model found on Wikipedia to a tee. I believe the problem has to do with the underlying conversion to binary format, but I'm not sure.

Code:

def encode_base64(data):
    raw_bits = ''.join('0' + bin(i)[2:] for i in data)
    # First bit is usually (always??) 0 in ascii characters
    
    split_by_six = [raw_bits[i: i + 6] for i in range(0, len(raw_bits), 6)]
    
    if len(split_by_six[-1]) < 6: # Add extra zeroes if necessary
        split_by_six[-1] = split_by_six[-1] + ((6 - len(split_by_six[-1])) * '0')
    
    padding = 2 if len(split_by_six) % 2 == 0 else 1
    if len(split_by_six) % 4 == 0: # See if padding is necessary
        padding = 0
    
    indexer = ([chr(i) for i in range(65, 91)] # Base64 Table
         + [chr(i) for i in range(97, 123)]
         + [chr(i) for i in range(48, 58)]
         + ['+', '/'])
    
    return ''.join(indexer[int(i, base=2)] for i in split_by_six) + ('=' * padding)

When I run the following sample code, I get the incorrect value, and you can see below:

print(base_64(b'any carnal pleasure'))
# OUTPUT: YW55QMbC5NzC2IHBsZWFzdXJl=
# What I should be outputting: YW55IGNhcm5hbCBwbGVhc3VyZS4=

For some odd reason, the first few characters are correct, and then the rest aren't. I am happy to answer any questions!

Upvotes: 3

Views: 114

Answers (1)

water_ghosts
water_ghosts

Reputation: 736

Python's bin() function doesn't include leading zeroes, so the length of a binary representation will vary:

>>> bin(1)
'0b1'
>>> bin(255)
'0b11111111'
>>> bin(ord("a"))
'0b1100001'
>>> bin(ord(" "))
'0b100000'

In your input, a, n, and y all have one leading zero in their binary representation, so the length of bin(i) is consistent. But the binary representation of ' ' has two leading zeroes, so bin(i) is one bit shorter than you expect, and the rest of raw_bits gets misaligned.

To fix this, make sure you pad the binary representation with leading zeroes until it's 8 characters. I don't think there's a particularly elegant way to do this, but you can use format(ord(i), "#010b")[2:] to make sure the full representation is 10 characters, then discard the 0b, leaving the 8 that you care about.

Upvotes: 2

Related Questions