Reputation: 18572
I can't understand one tricky point about encoding.
Why when you increase string at 2 (twice), it length is increased at 1.5.
Code:
public class Appl {
public static void main(String[] args) throws Exception {
System.out.println("A".getBytes("UTF-16").length);
System.out.println("AA".getBytes("UTF-16").length);
}
}
Output will be:
4
6
This may looks a little bit silly but I couldn't figure out why does this happen.
Any suggestions?
Upvotes: 1
Views: 64
Reputation: 18793
The first two bytes are the byte order mark, see http://en.wikipedia.org/wiki/Byte_Order_Mark. After that, each additional Java character takes up two bytes (Java internally uses UTF-16, but there are unicode code points that are encoded as two Java characters).
To see in detail what is going on, just print the byte array using Arrays.toString(...). The unicode code point for 'A' is 65.
Upvotes: 0
Reputation: 213193
UTF-16
encoding uses an optional byte-order mark to identify the byte order. See Charset
class for different charset information.
If you use, for example, UTF-16BE
- (BigEndian) instead, you will get the expected result:
System.out.println("A".getBytes("UTF-16BE").length); // 2 (2 + 2 with UTF-16)
System.out.println("AA".getBytes("UTF-16BE").length); // 4 (2 + 4 with UTF-16)
System.out.println("AAA".getBytes("UTF-16BE").length); // 6 (2 + 6 with UTF-16)
Upvotes: 2