Monty Evans
Monty Evans

Reputation: 151

Python 3.5 base64 decoding seems to be incorrect?

In Python 3.5 the base64 module has a method, standard_b64decode() for decoding strings from base64, which returns a bytes object.

When I run base64.standard_b64decode("wc==") the output is b\xc1. When you base64 encode "\xc1", you get "wQ==". It looks like there is an error in the decoding function. Actually, I think "wc==" is an invalid base64 encoded string, by this reasoning:

  1. wc== ends with ==, which means that it was produced from a single input byte.

  2. The corresponding values of 'w' and 'c' in the regular base64 alphabet are, respectively, 48 and 28, meaning their 6-bit representations are, respectively, 110000 and 011100.

  3. Concatenating these, the first 8 bits are 11000001, which is \xc1, but the remaining bits (1100) are non-zero, so couldn't have been produced by the padding process performed during base64 encoding, as that only appends bits with value 0, which means these extra 1 bits can't have been produced through valid base64 encoding -> the string is not a valid base64 encoded string.

I think this is true for any 4 character chunk of base64 encoding ending in == when any of the last 4 bits of the second character are 1.

I'm pretty convinced that this is right, but I'm rather less experienced than the Python developers.

Can anyone confirm the above, or explain why it's wrong, if indeed it is?

Upvotes: 2

Views: 1003

Answers (1)

Gareth Rees
Gareth Rees

Reputation: 65884

The Base64 standard is defined by RFC 4648. Your question is answered by §3.5:

Canonical Encoding

The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.

In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero.

The meaning of MAY is defined by RFC 2119:

MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item.

So Python is not obliged by the standard to reject non-canonical encodings.

Upvotes: 3

Related Questions