Reputation: 175
I do not understand why this code is not outputting the same thing? I thought the Java automatically figures out the encoding of the string?
public static void main (String[] args) {
try {
displayStringAsHex("A B C \u03A9".getBytes("UTF-8"));
System.out.println ("");
displayStringAsHex("A B C \u03A9".getBytes("UTF-16"));
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
}
/**
* I got part of this from: http://rgagnon.com/javadetails/java-0596.html
*/
public static void displayStringAsHex(byte[] raw ) {
String HEXES = "0123456789ABCDEF";
System.out.println("raw = " + new String(raw));
final StringBuilder hex = new StringBuilder( 2 * raw.length );
for ( final byte b : raw ) {
hex.append(HEXES.charAt((b & 0xF0) >> 4))
.append(HEXES.charAt((b & 0x0F))).append(" ");
}
System.out.println ("hex.toString() = "+ hex.toString());
}
outputs:
(UTF-8)
hex.toString() = 41 20 42 20 43 20 CE A9
(UTF 16)
hex.toString() = FE FF 00 41 00 20 00 42 00 20 00 43 00 20 03 A9
I cannot display the character output, but the UTF-8 version looks correct. The UTF-16 version has several squares and blocks.
Why don't they look the same?
Upvotes: 1
Views: 1356
Reputation: 589
Java does not automatically figure out the encoding of a string.
The String(byte[]) constructor
constructs a new String by decoding the specified array of bytes using the platform's default charset.`
In your case the UTF-16 bytes are being interpreted as UTF-8 and you end up with garbage.
Use new String(raw, Charset.forName("UTF-16"))
to rebuild the String.
Upvotes: 2