Why does this unicode character end up as 6 bytes with UTF-16 encoding?

Question

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte[] asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Karol Dowbecki · Accepted Answer

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte[] to hex-encoded String as per this answer:

asBytes   = F09F9C81
asBytes16 = FEFFD83DDF01

Why does this unicode character end up as 6 bytes with UTF-16 encoding?

Answers (2)

Related Questions