weird encodings output with the same string length

Question

I found some tricky place and couldn't understand how does this exactly happen.

Why string which contains one character can return different byte arrays?

Code:

public class Application {
    public static void main(String[] args) throws Exception {

        char ch;
        ch = 0x0001;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x0111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x1111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
    }
}

Output will be next:

[1]
[-60, -111]
[-31, -124, -111]

Why exactly this happen?

Tim Pietzcker · Accepted Answer

That's how UTF-8 works. Codepoints between 0 and 127 are encoded as single-byte values (to maintain ASCII compatibility); codepoints above that are encoded as two- to six-byte values.

Wikipedia screenshot

Screenshot taken from here.

So, for your examples:

0x0001 (0b00000001) is encoded as
(bin) 00000001 = (dec) 1
0x0111 (0b00000001 00010001) is encoded as
(bin) 11000100 10010001 = (dec) -60 -111
0x1111 (0b00010001 00010001) is encoded as
(bin) 11100001 11100001 10010001 = (dec) -31 -124 -111

weird encodings output with the same string length

Answers (1)

Related Questions