Koray Tugay
Koray Tugay

Reputation: 23780

How does Java fit a 3 byte Unicode character into a char type?

So a 'char' in Java is 2 bytes. (Can be verified from here.)

I have this sample code:

public class FooBar {
    public static void main(String[] args) {
        String foo = "€";
        System.out.println(foo.getBytes().length);
        final char[] chars = foo.toCharArray();
        System.out.println(chars[0]);
    }
}

And the output is as follows:

3
€

My question is, how did Java fit a 3 byte character into a char data type? BTW, I am running the application with the parameter: -Dfile.encoding=UTF-8

Also if I edit the code a little further and add the following statements:

File baz = new File("baz.txt");
final DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream(baz));
dataOutputStream.writeChar(chars[0]);
dataOutputStream.flush();
dataOutputStream.close();

the final file "baz.txt" will only be 2 bytes, and it will not show the correct character even if I treat it as a UTF-8 file.

Edit 2: If I open the file "baz.txt" with encoding UTF-16 BE, I will see the € character just fine in my text editor, which makes sense I guess.

Upvotes: 14

Views: 3603

Answers (2)

Shiladittya Chakraborty
Shiladittya Chakraborty

Reputation: 4408

String.getBytes() returns the bytes using the platform's default character encoding which does not necessary match internal representation.

Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works.

Your code example is using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.

If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.

Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.

Check this link : java utf8 encoding - char, string types

Upvotes: 10

Thilo
Thilo

Reputation: 262474

Java uses UTF-16 (16 bits) for the in-memory representation.

That Euro symbol fits into that, even though it needs three bytes in UTF-8.

Upvotes: 8

Related Questions