Reputation: 7859
If I convert a character to byte
and then back to char
, that character mysteriously disappears and becomes something else. How is this possible?
This is the code:
char a = 'È'; // line 1
byte b = (byte)a; // line 2
char c = (char)b; // line 3
System.out.println((char)c + " " + (int)c);
Until line 2 everything is fine:
In line 1 I could print "a" in the console and it would show "È".
In line 2 I could print "b" in the console and it would show -56, that is 200 because byte is signed. And 200 is "È". So it's still fine.
But what's wrong in line 3? "c" becomes something else and the program prints ? 65480
. That's something completely different.
What I should write in line 3 in order to get the correct result?
Upvotes: 64
Views: 222589
Reputation: 93968
TL;DR: The byte value 0xC8
is a negative value which gets widened to a negative integer, and then narrowed back again to a positive character with value 0xFFC8
. Use char c = (char) (b & 0xFF)
so that the intermediate integer value remains a positive 0x000000C8
.
A character in Java is a Unicode code-unit which is treated as an unsigned number. The Java char is is basically UTF-16BE, which maps character È
to value 0x00C8
. Once this is narrowed down to a byte containing 0xC8
it represents the value -56. When casting to char the value is first widened to the integer representation of -56: 0xFFFFFFC8
. This in turn is then narrowed down to 0xFFC8
when casting to a char
which translates to the positive number 65480. This then gets widened to an integer during your println
statement to 0x0000FFC8
to represent the same value.
The Java Language Specification documents:
5.1.4. Widening and Narrowing Primitive Conversion
First, the byte is converted to an int via widening primitive conversion (§5.1.2), and then the resulting int is converted to a char by narrowing primitive conversion (§5.1.3).
Note that in Java, most integer calculations are performed using 32-bit integers if the two operands are different types, unless one of them is a long
. So widening and narrowing conversions are relatively common (although they may be optimized away in the VM).
Let's show what happens in bits:
char a = 'È'; // 0b00000000_10101000 => a
byte b = (byte)a; // 0b00000000_10101000
// -(narrowing)->
// 0b1_0101000 => b
char c = (char)b; // 0b1_0101000
// -(widening)->
// 0b1_1111111_11111111_11111111_10101000
// -(narrowing)->
// 0b11111111_10101000 => c
println(.. (int)c); // 0b11111111_10101000
// -(widening)->
// 0b00000000_00000000_11111111_10101000
Here the leftmost sign bit of the signed representations for byte
and int
has been separated by a _
. The reason why the widening or sign extension happens is because the byte 0b1_0101000
and the 32 bit integer 0b1_1111111_11111111_11111111_10101000
both represent the value -56 in two's complement representation.
To get the expected result use char c = (char) (b & 0xFF)
which first converts the byte value of b
to the positive integer 200
by using a mask, zeroing the top 24 bits after conversion: 0xFFFFFFC8
becomes 0x000000C8
or the positive number 200
in decimals.
char c = (char) (b & 0xFF); // 0b1_0101000
// -(widening)->
// 0b1_1111111_11111111_11111111_10101000
// -(AND with 0b0_0000000_00000000_00000000_11111111)->
// 0b0_0000000_00000000_00000000_10101000 (narrowing)->
// 0b00000000_10101000 => c
Note that 0xFF
doesn't represent a byte, it represents an integer; so it is basically the same as 0x000000FF
.
Above is a direct explanation of what happens during conversion between the byte
, int
and char
primitive types.
If you want to encode/decode characters from bytes, look at classes such as Charset
, CharsetEncoder
, CharsetDecoder
within the java.nio.charset
package. Convenience methods such as new String(byte[] bytes, Charset charset)
or String#toBytes(Charset charset)
have been defined as well. The required character sets for the Java runtime can be accessed directly as constants through the java.nio.charset.StandardCharsets
utility class.
Upvotes: 89
Reputation: 1
This worked for me: //Add import statement
import java.nio.charset.Charset;
// Change
sun.io.ByteToCharConverter.getDefault().getCharacterEncoding() -> Charset.defaultCharset()
Upvotes: 0
Reputation: 1342
new String(byteArray, Charset.defaultCharset())
This will convert a byte array to the default charset in java. It may throw exceptions depending on what you supply with the byteArray.
Upvotes: -3