Reputation: 18612
I found some tricky place and couldn't understand how does this exactly happen.
Why string which contains one character can return different byte arrays?
Code:
public class Application {
public static void main(String[] args) throws Exception {
char ch;
ch = 0x0001;
System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
ch = 0x0111;
System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
ch = 0x1111;
System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
}
}
Output will be next:
[1]
[-60, -111]
[-31, -124, -111]
Why exactly this happen?
Upvotes: 0
Views: 47
Reputation: 336478
That's how UTF-8 works. Codepoints between 0 and 127 are encoded as single-byte values (to maintain ASCII compatibility); codepoints above that are encoded as two- to six-byte values.
Screenshot taken from here.
So, for your examples:
0x0001
(0b00000001
) is encoded as00000001
= (dec) 1
0x0111
(0b00000001 00010001
) is encoded as11000100 10010001
= (dec) -60 -111
0x1111
(0b00010001 00010001
) is encoded as11100001 11100001 10010001
= (dec) -31 -124 -111
Upvotes: 2