leonsPAPA
leonsPAPA

Reputation: 797

Why the size of base64-encoded string is larger than the original file

My original PDF file size is around 24MB, however when I encode it to based64 string, the string size is around 31MB. I'm wondering why that is.

It is easy to understand for an image file since it may lose some compression, but it also happens to PDF or some other format files?

Upvotes: 25

Views: 20935

Answers (1)

T.J. Crowder
T.J. Crowder

Reputation: 1074028

just wondering why

Because Base64 has fewer meaningful bits per byte than a binary data format (usually 6 instead of 8). This is specifically so it can survive various textual transformations that binary data would not.

Wikipedia's page has a good diagram showing this:

enter image description here

As a text table (sadly the GitHub-flavored markdown used by SO doesn't support tables with varying numbers of columns):

+−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+
|   Text content  |               M               |               a               |               n               |
+−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+
|     ASCII       |           77 (0x4d)           |           97 (0x61)           |          110 (0x6e)           |
|  Bit pattern    | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
|     Index       |           19          |           22          |           5           |           46          |
| Base64−encoded  |           T           |           W           |           F           |           u           |
+−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−+

Note how the Base64 is only using the bottom six bits of each byte, and so "Man" ends up being four bytes long.

It is easy to understand for image file since it may lose some compression

Just to be clear, Base64 encoding is lossless. When you decode it, you get byte-for-byte what you started with.

Upvotes: 48

Related Questions