Reputation: 3426
I have a transparent, 1x1 GIF file with the following data:
$ xxd pixel.gif
00000000: 4749 4638 3961 0100 0100 f000 0000 0000 GIF89a..........
00000010: 0000 0021 f904 0100 0000 002c 0000 0000 ...!.......,....
00000020: 0100 0100 0002 0244 0100 3b .......D..;
The Base64 encoded data for this file is as follows:
$ openssl base64 -in pixel.gif
R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==
If I decode this string, I get the following correct output:
$ echo 'R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==' | openssl base64 -d | xxd
00000000: 4749 4638 3961 0100 0100 f000 0000 0000 GIF89a..........
00000010: 0000 0021 f904 0100 0000 002c 0000 0000 ...!.......,....
00000020: 0100 0100 0002 0244 0100 3b
When trying to decode this string in Java, I get unexpected results back. Consider this example Java program:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
public class Decode {
public static void main(String[] args) {
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
String line = reader.readLine();
//System.out.println(line.getBytes());
byte[] data = Base64.getDecoder().decode(line.getBytes());
System.out.print(new String(data, 0, data.length, StandardCharsets.UTF_8));
} catch (IOException e) {
System.out.println("IOException reading System.in");
}
}
}
When I pipe the encoded string to this program, I get the following results
$ echo 'R0lGODlhAQABAPAAAAAAAAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==' | java Decode | xxd
00000000: 4749 4638 3961 0100 0100 efbf bd00 0000 GIF89a..........
00000010: 0000 0000 0021 efbf bd04 0100 0000 002c .....!.........,
00000020: 0000 0000 0100 0100 0002 0244 0100 3b ...........D..
I can see at the 11th byte the expected output of 0xf0
changes to 0xef
. The entire binary string is now 47 bytes long instead of 43 bytes long. Why is this happening with Java?
Upvotes: 0
Views: 430
Reputation: 265211
You cannot convert arbitrary binary data to a UTF-8 string. UTF-8 is a unicode encoding which follows certain rules (e.g. all multibyte sequences must start with 11 or 10 as high-bits, and the first byte of a multibyte sequence tells the decoder how many bytes are contained in this multibyte sequence)
What you really want is to write the byte array directly, not convert it to a String first:
byte[] data = Base64.getDecoder().decode(line.getBytes());
System.out.write(data);
Upvotes: 4