Reputation: 35
I have converted a string which has foreign character - 晝 to byte array. Byte array can store values between -128 to 127, so corresponding value has been stored as 3 bytes ---> -26,-103,-99.
Here's the conversion code:
String str = "晝";
byte[] b = str.getBytes();
for(byte bt : b)
System.out.println(bt);
String str1 = new String(b);
System.out.println(str1);
Can you please clarify how this 3 bytes has been calculated for the foreign character
Upvotes: 2
Views: 165
Reputation: 234807
All conversions from characters to bytes uses some character set do do the encoding.
You don't say, but I assume that you did the conversion using String.getBytes()
. This is simply a shortcut for String.getBytes(Charset.defaultCharset())
and the default Charset
depends on your particular Java environment. The three values you report are (in hex) 0xE6 0x99 0x9D
. which is the UTF-8 encoding of U+665D (Unicode Han Character 'daytime, daylight'). Since that's the character that you report having started with, presumably the default character set for your environment is UTF-8 (which is not a surprise, but not something you can count on everywhere).
Upvotes: 1
Reputation: 2446
晝 is U+665D. It looks like when you converted it, you converted it to UTF-8. UTF-8 is a variable length encoding of Unicode characters. Characters in [U+0800, U+FFFF] are converted to 3 bytes.
According to this converter, U+665D is E6 99 9D in UTF-8 (in hex, 230 153 157 in decimal, which will be needed in a bit). Because a byte is -128 to 127, values larger than 127 are displayed as the number less 256, so as bytes, 230 153 157 is 230-256 153-256 157-256, or -26 -103 -99, which is what you're seeing.
Upvotes: 3