Senthil
Senthil

Reputation: 35

Bytearray from string

I have converted a string which has foreign character - 晝 to byte array. Byte array can store values between -128 to 127, so corresponding value has been stored as 3 bytes ---> -26,-103,-99.

Here's the conversion code:

String str = "晝"; 
byte[] b = str.getBytes(); 

for(byte bt : b) 
    System.out.println(bt); 

String str1 = new String(b);
System.out.println(str1);

Can you please clarify how this 3 bytes has been calculated for the foreign character

Upvotes: 2

Views: 165

Answers (2)

Ted Hopp
Ted Hopp

Reputation: 234807

All conversions from characters to bytes uses some character set do do the encoding.

You don't say, but I assume that you did the conversion using String.getBytes(). This is simply a shortcut for String.getBytes(Charset.defaultCharset()) and the default Charset depends on your particular Java environment. The three values you report are (in hex) 0xE6 0x99 0x9D. which is the UTF-8 encoding of U+665D (Unicode Han Character 'daytime, daylight'). Since that's the character that you report having started with, presumably the default character set for your environment is UTF-8 (which is not a surprise, but not something you can count on everywhere).

Upvotes: 1

blm
blm

Reputation: 2446

晝 is U+665D. It looks like when you converted it, you converted it to UTF-8. UTF-8 is a variable length encoding of Unicode characters. Characters in [U+0800, U+FFFF] are converted to 3 bytes.

According to this converter, U+665D is E6 99 9D in UTF-8 (in hex, 230 153 157 in decimal, which will be needed in a bit). Because a byte is -128 to 127, values larger than 127 are displayed as the number less 256, so as bytes, 230 153 157 is 230-256 153-256 157-256, or -26 -103 -99, which is what you're seeing.

Upvotes: 3

Related Questions