Reputation: 23780
So a 'char' in Java is 2 bytes. (Can be verified from here.)
I have this sample code:
public class FooBar {
public static void main(String[] args) {
String foo = "€";
System.out.println(foo.getBytes().length);
final char[] chars = foo.toCharArray();
System.out.println(chars[0]);
}
}
And the output is as follows:
3
€
My question is, how did Java fit a 3 byte character into a char data type? BTW, I am running the application with the parameter: -Dfile.encoding=UTF-8
Also if I edit the code a little further and add the following statements:
File baz = new File("baz.txt");
final DataOutputStream dataOutputStream = new DataOutputStream(new FileOutputStream(baz));
dataOutputStream.writeChar(chars[0]);
dataOutputStream.flush();
dataOutputStream.close();
the final file "baz.txt" will only be 2 bytes, and it will not show the correct character even if I treat it as a UTF-8 file.
Edit 2: If I open the file "baz.txt" with encoding UTF-16 BE, I will see the € character just fine in my text editor, which makes sense I guess.
Upvotes: 14
Views: 3603
Reputation: 4408
String.getBytes()
returns the bytes using the platform's default character encoding which does not necessary match internal representation.
Java using 2 bytes in ram for each char, when chars are "serialized" using UTF-8, they may produce one, two or three bytes in the resulting byte array, that's how the UTF-8 encoding works.
Your code example is using UTF-8. Java strings are encoded in memory using UTF-16 instead. Unicode codepoints that do not fit in a single 16-bit char will be encoded using a 2-char pair known as a surrogate pair.
If you do not pass a parameter value to String.getBytes(), it returns a byte array that has the String contents encoded using the underlying OS's default charset. If you want to ensure a UTF-8 encoded array then you need to use getBytes("UTF-8") instead.
Calling String.charAt() returns an original UTF-16 encoded char from the String's in-memory storage only.
Check this link : java utf8 encoding - char, string types
Upvotes: 10
Reputation: 262474
Java uses UTF-16 (16 bits) for the in-memory representation.
That Euro symbol fits into that, even though it needs three bytes in UTF-8.
Upvotes: 8