Coldchain9
Coldchain9

Reputation: 1745

How to get same results from base64 atob in Javascript vs Python

I found some code online that I am trying to work through which encodes to base64. I know Python has base64.urlsafe_b64decode() but I would like to learn a bit more about what is going on.

The JS atob looks like:

function atob (input) {
  var chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=';
  var str = String(input).replace(/=+$/, '');
  if (str.length % 4 == 1) {
    throw new InvalidCharacterError("'atob' failed: The string to be decoded is not correctly encoded.");
  }
  for (
    // initialize result and counters
    var bc = 0, bs, buffer, idx = 0, output = '';
    // get next character
    buffer = str.charAt(idx++);
    // character found in table? initialize bit storage and add its ascii value;
    ~buffer && (bs = bc % 4 ? bs * 64 + buffer : buffer,
      // and if not first of each 4 characters,
      // convert the first 8 bits to one ascii character
      bc++ % 4) ? output += String.fromCharCode(255 & bs >> (-2 * bc & 6)) : 0
  ) {
    // try to find character in table (0-63, not found => -1)
    buffer = chars.indexOf(buffer);
  }
  return output;
}

My goal is to port this Python, but I am trying to understand what the for loop is doing in Javascript.

It checks if the value is located in the chars table and then initializes some variables using a ternary like: bs = bc % 4 ? bs*64+buffer: buffer, bc++ %4

I am not quite sure I understand what the buffer, bc++ % 4 part of the ternary is doing. The comma confuses me a bit. Plus the String.fromCharCode(255 & (bs >> (-2 * bc & 6))) is a bit esoteric to me.

I've been trying something like this in Python, which produces some results, albeit different than what the javascript implementation is doing

# Test subject
b64_str: str = "fwHzODWqgMH+NjBq02yeyQ=="
    
# Lookup table for characters
chars: str = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/="

# Replace right padding with empty string
replaced = re.sub("=+$", '', b64_str)
if len(replaced) % 4 == 1:
    raise ValueError("atob failed. The string to be decoded is not valid base64")

# Bit storage and counters
bc = 0
out: str = ''
for i in replaced:

    # Get ascii value of character
    buffer = ord(i)

    # If counter is evenly divisible by 4, return buffer as is, else add the ascii value
    bs = bc * 64 + buffer if bc % 4 else buffer
    bc += 1 % 4 # Not sure I understand this part
    
    # Check if character is in the chars table
    if i in chars:

        # Check if the bit storage and bit counter are non-zero
        if bs and bc:
            # If so, convert the first 8 bits to an ascii character
            out += chr(255 & bs >> (-2 * bc & 6))
        else:
            out = 0
            
    # Set buffer to the index of where the first instance of the character is in the b64 string
    print(f"before: {chr(buffer)}")
    buffer = chars.index(chr(buffer))
    print(f"after: {buffer}")
    
print(out)

JS gives ó85ªÁþ60jÓlÉ

Python gives 2:u1(²ë:ð1G>%Y

Upvotes: -2

Views: 56

Answers (2)

mplungjan
mplungjan

Reputation: 177685

  • The loop processes each character in chunks of four, converting each Base64 character back into its binary form.
  • bc helps keep track of where we are in these 24-bit groups.
  • bs accumulates the bits from the Base64 characters, and output builds the decoded string by converting 8-bit chunks of bs to characters.
  • The ternary operation and bitwise shifts are used to manipulate and extract the correct bits from the Base64 data.

Here is a tested version https://www.online-python.com/PiseKNFuaO

import base64

class InvalidCharacterError(Exception):
    pass

def atob(input_str):
    chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/='
    input_str = str(input_str).rstrip('=')
    
    if len(input_str) % 4 == 1:
        raise InvalidCharacterError("'atob' failed: The string to be decoded is not correctly encoded.")
    
    output = []
    bc = 0
    bs = 0
    buffer = 0
    
    for char in input_str:
        buffer = chars.find(char)
        
        if buffer == -1:
            raise InvalidCharacterError("'atob' failed: The string to be decoded contains an invalid character.")
        
        bs = (bs << 6) + buffer
        bc += 6
        
        if bc >= 8:
            bc -= 8
            output.append(chr((bs >> bc) & 255))
    
    return ''.join(output)

# Compare with Python's built-in Base64 decoding
def test_atob():
    test_strings = [
        "SGVsbG8gd29ybGQ=",  # "Hello world"
        "U29mdHdhcmUgRW5naW5lZXJpbmc=", # "Software Engineering"
        "VGVzdGluZyAxMjM=", # "Testing 123"
        "SGVsbG8gd29ybGQ==",  # "Hello world" with extra padding
        "SGVsbG8gd29ybGQ= ",  # "Hello world" with trailing space (invalid)
        "SGVsbG8gd29ybGQ\r\n",  # "Hello world" with newline characters (invalid)
        "Invalid!!==",  # Invalid characters
        "VGhpcyBpcyBhbiBlbmNvZGVkIHN0cmluZyE", # "This is an encoded string!" without padding
        "U29tZVNwZWNpYWwgQ2hhcnM6ICsgLyA=", # "SomeSpecial Chars: + / " with padding
    ]
    
    for encoded in test_strings:
        try:
            expected = base64.b64decode(encoded).decode('utf-8')
            result = atob(encoded)
            print(result == expected, "Custom:", result, "Expected:", expected)
        except Exception as e:
            print(f"Error for string: {encoded} - {e}")

test_atob()

Upvotes: 1

Daweo
Daweo

Reputation: 36340

I've been trying something like this in Python, which produces some results, albeit different than what the javascript implementation is doing

First step would be determining if either implementation works right, RFC4648 contains Tests Vectors for that purpose

BASE64("") = ""
BASE64("f") = "Zg=="
BASE64("fo") = "Zm8="
BASE64("foo") = "Zm9v"
BASE64("foob") = "Zm9vYg=="
BASE64("fooba") = "Zm9vYmE="
BASE64("foobar") = "Zm9vYmFy"

If one implementation works correctly you should determine what is causing difference, otherwise you might attempt to implement base64decode based on description contained in mentioned RFC4648.

Upvotes: 0

Related Questions