art1go
art1go

Reputation: 89

Handling a char as a byte in Java, different results

Why the two following results are different?

bsh % System.out.println((byte)'\u0080');
-128

bsh % System.out.println("\u0080".getBytes()[0]);
63

Thanks for your answers.

Upvotes: 3

Views: 1650

Answers (5)

Paŭlo Ebermann
Paŭlo Ebermann

Reputation: 74800

Actually, if you want to get the same result with the toString() call, specify UTF-16_LE as the charset encoding:

bsh %  System.out.println("\u0080".getBytes("UTF-16LE")[0]); 
-128

Java Strings are encoded internally as UTF-16, and since we want the lower byte like for the cast char -> byte, we use little endian here. Big endian works too, if we change the array index:

bsh %  System.out.println("\u0080".getBytes("UTF-16BE")[1]);
-128

Upvotes: 0

Michael Borgwardt
Michael Borgwardt

Reputation: 346536

(byte)'\u0080' just takes the numerical value of the codepoint, which does not fit into a byte and thus is subject to a narrowing primitive conversion which drops the bits that don't fit into the byte and, since the highest-order bit is set, yields a negative number.

"\u0080".getBytes()[0] transforms the characters to bytes according to your platform default encoding (there is an overloaded getBytes() method that allows you to specify the encoding). It looks like your platform default encoding cannot represent codepoint U+0080, and replaces it by "?" (codepoint U+003F, decimal value 63).

Upvotes: 5

Bozho
Bozho

Reputation: 597412

Here the byte array has 2 elements - that's because the representation of unicode chars does not fit in 1 byte.

On my machine the array contains [-62, -128]. That's because my default encoding is UTF-8. Never use getBytes() without specifying an encoding.

Upvotes: 2

Peter Lawrey
Peter Lawrey

Reputation: 533880

When you have a character which a character encoding doesn't support it turns it into '?' which is 63 in ASCII.

try

System.out.println(Arrays.toString("\u0080".getBytes("UTF-8")));

prints

[-62, -128]

Upvotes: 1

axtavt
axtavt

Reputation: 242786

Unicode character U+0080 <control> can't be represented in your system default encoding and therefore is replaced by ? (ASCII code 0x3F = 63) when string is encoded into your default encoding by getBytes().

Upvotes: 3

Related Questions