mjc
mjc

Reputation: 3426

Java Base64 decode results differs unexpectedly

I have a transparent, 1x1 GIF file with the following data:

$ xxd pixel.gif
00000000: 4749 4638 3961 0100 0100 f000 0000 0000  GIF89a..........
00000010: 0000 0021 f904 0100 0000 002c 0000 0000  ...!.......,....
00000020: 0100 0100 0002 0244 0100 3b              .......D..;

The Base64 encoded data for this file is as follows:

$ openssl base64 -in pixel.gif
R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==

If I decode this string, I get the following correct output:

$ echo 'R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==' | openssl base64 -d | xxd
00000000: 4749 4638 3961 0100 0100 f000 0000 0000  GIF89a..........
00000010: 0000 0021 f904 0100 0000 002c 0000 0000  ...!.......,....
00000020: 0100 0100 0002 0244 0100 3b

When trying to decode this string in Java, I get unexpected results back. Consider this example Java program:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;

import java.nio.charset.StandardCharsets;

import java.util.Base64;

public class Decode {
    public static void main(String[] args) {
        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
            String line = reader.readLine();

            //System.out.println(line.getBytes());
            byte[] data = Base64.getDecoder().decode(line.getBytes());
            System.out.print(new String(data, 0, data.length, StandardCharsets.UTF_8));
        } catch (IOException e) {
            System.out.println("IOException reading System.in");
        }
    }
}

When I pipe the encoded string to this program, I get the following results

$ echo 'R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==' | java Decode | xxd
00000000: 4749 4638 3961 0100 0100 efbf bd00 0000  GIF89a..........
00000010: 0000 0000 0021 efbf bd04 0100 0000 002c  .....!.........,
00000020: 0000 0000 0100 0100 0002 0244 0100 3b    ...........D..

I can see at the 11th byte the expected output of 0xf0 changes to 0xef. The entire binary string is now 47 bytes long instead of 43 bytes long. Why is this happening with Java?

Upvotes: 0

Views: 430

Answers (1)

knittl
knittl

Reputation: 265211

You cannot convert arbitrary binary data to a UTF-8 string. UTF-8 is a unicode encoding which follows certain rules (e.g. all multibyte sequences must start with 11 or 10 as high-bits, and the first byte of a multibyte sequence tells the decoder how many bytes are contained in this multibyte sequence)

What you really want is to write the byte array directly, not convert it to a String first:

byte[] data = Base64.getDecoder().decode(line.getBytes());
System.out.write(data);

Upvotes: 4

Related Questions