Reputation: 121028
I have a really simple question actually, what is the minimum size (number of bytes) that letter "A" is supposed to occupy in UTF-16 encoding when coding in Java (this should be irrelevant though).
I really thought this one was pretty simple: since UTF-16 is either 2 or 4 bytes, and letter A is well a "simple" one, the answer should be two, but then:
System.out.println("A".getBytes(StandardCharsets.UTF_8).length); // prints 1, as expected
System.out.println("A".getBytes(StandardCharsets.UTF_16).length); // prints 4, I thought it would be 2
System.out.println("AB".getBytes(StandardCharsets.UTF_8).length);// prints 2 as expected
System.out.println("AB".getBytes(StandardCharsets.UTF_16).length); // prints 6, expected 4
Can someone bring some light here?
Upvotes: 1
Views: 87
Reputation: 533820
When you use UTF-16 it needs to define whether it is little endian or big endian. It does with with a BOM or \uFEFF
or \uFFFE
which two extra bytes at the start.
UTF-8 is one byte at a time so there is no byte order to keep track of.
If you use UTF-16BE or UTF-16LE it defines the order so it shouldn't need the BOM
Upvotes: 2