encodings misunderstoud at java

Question

I can't understand one tricky point about encoding.

Why when you increase string at 2 (twice), it length is increased at 1.5.

Code:

public class Appl {
    public static void main(String[] args) throws Exception {

        System.out.println("A".getBytes("UTF-16").length);
        System.out.println("AA".getBytes("UTF-16").length);
    }
}

Output will be:

4
6

This may looks a little bit silly but I couldn't figure out why does this happen.

Any suggestions?

Rohit Jain · Accepted Answer

UTF-16 encoding uses an optional byte-order mark to identify the byte order. See Charset class for different charset information.

If you use, for example, UTF-16BE - (BigEndian) instead, you will get the expected result:

System.out.println("A".getBytes("UTF-16BE").length);   // 2 (2 + 2 with UTF-16)
System.out.println("AA".getBytes("UTF-16BE").length);  // 4 (2 + 4 with UTF-16)
System.out.println("AAA".getBytes("UTF-16BE").length); // 6 (2 + 6 with UTF-16)

encodings misunderstoud at java

Answers (2)

Related Questions