Null characters added when a string is encoded into UTF-8 bytes?

Question

The code snippet:

public static void main(String[] args) {
    String s = "qwertyuiop";
    System.out.println(Arrays.toString(Charset
       .forName("UTF-8")
       .encode(s)
       .array()));
}

Prints:

[113, 119, 101, 114, 116, 121, 117, 105, 111, 112, 0]

That seems to happen because, under the hood, averageBytesPerChar variable appears to be 1.1 for UTF-8 inside java.nio.charset.CharsetEncoder class. Hence it allocates 11 bytes instead of 10 and, provided the input string contains only good old single byte chars, I get that odd null character in the end.

I wonder if this is documented anywhere?

This page:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#encode(java.lang.String)

Doesn't give a clue about such behaviour.

P. S. Do I get it right that in any case the snippet above would better be replaced by:

s.getBytes(StandardCharsets.UTF_8)

Which as I see from its source also trims the result in order to avoid those null chars?

Then, what the java.nio.charset.Charset's encode(String s) is supposed to be for?

kennytm · Accepted Answer

The problem is not in `Charset.encode()`, but `Buffer.array()`.

If you printed Charset.forName("UTF-8").encode(s), you will find the output to be

java.nio.HeapByteBuffer[pos=0 lim=10 cap=11]

The ByteBuffer has limit 10, the length of the string, and capacity 11, the total allocated size of the buffer. If you change the encoding the limit and capacity may have even wilder variation, e.g.

System.out.println(Charset.forName("UTF-16").encode(s));
// java.nio.HeapByteBuffer[pos=0 lim=22 cap=41]
// (2 extra bytes because of the BOM, not null-termination)

When you call .array(), it will return the whole backing array, so even stuff beyond the limit will be included.

The actual method to extract a Java byte array is through the .get() method:

ByteBuffer buf = Charset.forName("UTF-8").encode(s);
byte[] encoded = new byte[buf.limit()];
buf.get(encoded);
System.out.println(Arrays.toString(encoded));

Well this looks like a mess? Because "nio" means Native I/O. The Buffer type is created so that it can easily wrap a C array. It makes interacting with native code such as reading/writing file or sending/receiving network data very efficient. These NIO APIs typically take a Buffer directly, without constructing any byte[] in between. If you are only working with Buffer, the middle two lines do not need to exist :).

If the whole operation stays within Java, yes just call s.getBytes(StandardCharsets.UTF_8).

Null characters added when a string is encoded into UTF-8 bytes?

Answers (2)

The problem is not in `Charset.encode()`, but `Buffer.array()`.

Related Questions

Null characters added when a string is encoded into UTF-8 bytes?

Answers (2)

The problem is not in Charset.encode(), but Buffer.array().

Related Questions

The problem is not in `Charset.encode()`, but `Buffer.array()`.