Reputation: 10045
I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:
final char[] chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte[] asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte[] asBytes16 = s.getBytes(StandardCharsets.UTF_16);
chars
has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).
asBytes
has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
asBytes16
has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Upvotes: 8
Views: 1898
Reputation: 595402
asBytes
has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of char
s needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701
) uses 17 bits (11111011100000001
)
0x1F701
requires 4 bytes in UTF-8 (F0 9F 9C 81
) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16
has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701
requires 4 bytes in UTF-16 (D8 3D DF 01
) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16
includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE
or StandardCharsets.UTF_16LE
as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
Upvotes: 3
Reputation: 44932
UTF-16 bytes start with Byte order mark FEFF
to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte[]
to hex-encoded String
as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
Upvotes: 5