user1883212
user1883212

Reputation: 7859

Byte and char conversion in Java

If I convert a character to byte and then back to char, that character mysteriously disappears and becomes something else. How is this possible?

This is the code:

char a = 'È';       // line 1       
byte b = (byte)a;   // line 2       
char c = (char)b;   // line 3
System.out.println((char)c + " " + (int)c);

Until line 2 everything is fine:

But what's wrong in line 3? "c" becomes something else and the program prints ? 65480. That's something completely different.

What I should write in line 3 in order to get the correct result?

Upvotes: 64

Views: 222589

Answers (3)

Maarten Bodewes
Maarten Bodewes

Reputation: 93968

TL;DR: The byte value 0xC8 is a negative value which gets widened to a negative integer, and then narrowed back again to a positive character with value 0xFFC8. Use char c = (char) (b & 0xFF) so that the intermediate integer value remains a positive 0x000000C8.

What happens

Short explanation

A character in Java is a Unicode code-unit which is treated as an unsigned number. The Java char is is basically UTF-16BE, which maps character È to value 0x00C8. Once this is narrowed down to a byte containing 0xC8 it represents the value -56. When casting to char the value is first widened to the integer representation of -56: 0xFFFFFFC8. This in turn is then narrowed down to 0xFFC8 when casting to a char which translates to the positive number 65480. This then gets widened to an integer during your println statement to 0x0000FFC8 to represent the same value.

The JLS

The Java Language Specification documents:

5.1.4. Widening and Narrowing Primitive Conversion

First, the byte is converted to an int via widening primitive conversion (§5.1.2), and then the resulting int is converted to a char by narrowing primitive conversion (§5.1.3).

Note that in Java, most integer calculations are performed using 32-bit integers if the two operands are different types, unless one of them is a long. So widening and narrowing conversions are relatively common (although they may be optimized away in the VM).

Worked example

Let's show what happens in bits:

char a = 'È';       // 0b00000000_10101000 => a

byte b = (byte)a;   // 0b00000000_10101000
                    // -(narrowing)->
                    // 0b1_0101000  => b

char c = (char)b;   // 0b1_0101000
                    // -(widening)->
                    // 0b1_1111111_11111111_11111111_10101000
                    // -(narrowing)->
                    // 0b11111111_10101000 => c

println(.. (int)c); // 0b11111111_10101000
                    // -(widening)->
                    // 0b00000000_00000000_11111111_10101000

Here the leftmost sign bit of the signed representations for byte and int has been separated by a _. The reason why the widening or sign extension happens is because the byte 0b1_0101000 and the 32 bit integer 0b1_1111111_11111111_11111111_10101000 both represent the value -56 in two's complement representation.

Handling the problem

Getting the expected result

To get the expected result use char c = (char) (b & 0xFF) which first converts the byte value of b to the positive integer 200 by using a mask, zeroing the top 24 bits after conversion: 0xFFFFFFC8 becomes 0x000000C8 or the positive number 200 in decimals.

Worked example

char c = (char) (b & 0xFF); // 0b1_0101000 
                            // -(widening)->
                            // 0b1_1111111_11111111_11111111_10101000
                            // -(AND with 0b0_0000000_00000000_00000000_11111111)->
                            // 0b0_0000000_00000000_00000000_10101000    (narrowing)->
                            // 0b00000000_10101000 => c

Note that 0xFF doesn't represent a byte, it represents an integer; so it is basically the same as 0x000000FF.

Dealing with character-encoding

Above is a direct explanation of what happens during conversion between the byte, int and char primitive types.

If you want to encode/decode characters from bytes, look at classes such as Charset, CharsetEncoder, CharsetDecoder within the java.nio.charset package. Convenience methods such as new String(byte[] bytes, Charset charset) or String#toBytes(Charset charset) have been defined as well. The required character sets for the Java runtime can be accessed directly as constants through the java.nio.charset.StandardCharsets utility class.

Upvotes: 89

Vivek Kumar
Vivek Kumar

Reputation: 1

This worked for me: //Add import statement

import java.nio.charset.Charset;

// Change

sun.io.ByteToCharConverter.getDefault().getCharacterEncoding() -> Charset.defaultCharset()

Upvotes: 0

Joe
Joe

Reputation: 1342

new String(byteArray, Charset.defaultCharset())

This will convert a byte array to the default charset in java. It may throw exceptions depending on what you supply with the byteArray.

Upvotes: -3

Related Questions