bopcat
bopcat

Reputation: 478

Null characters added when a string is encoded into UTF-8 bytes?

The code snippet:

public static void main(String[] args) {
    String s = "qwertyuiop";
    System.out.println(Arrays.toString(Charset
       .forName("UTF-8")
       .encode(s)
       .array()));
}

Prints:

[113, 119, 101, 114, 116, 121, 117, 105, 111, 112, 0]

That seems to happen because, under the hood, averageBytesPerChar variable appears to be 1.1 for UTF-8 inside java.nio.charset.CharsetEncoder class. Hence it allocates 11 bytes instead of 10 and, provided the input string contains only good old single byte chars, I get that odd null character in the end.

I wonder if this is documented anywhere?

This page:

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html#encode(java.lang.String)

Doesn't give a clue about such behaviour.

P. S. Do I get it right that in any case the snippet above would better be replaced by:

s.getBytes(StandardCharsets.UTF_8)

Which as I see from its source also trims the result in order to avoid those null chars?

Then, what the java.nio.charset.Charset's encode(String s) is supposed to be for?

Upvotes: 3

Views: 6641

Answers (2)

Michael Gantman
Michael Gantman

Reputation: 7790

If you just want to see the byte array of your String encoded as UTF8 then simply use getBytes(Charset charset) method. It may look something like this:

String hello = "qwertyuiop";   
byte[] helloBytes_UTF_8 = hello.getBytes(StandardCharsets.UTF_8);

You will see that there are 2 bytes per character. your output conforms to StandardCharsets.ISO_8859_1 charset. If you want to play more with different encodings then I would recommend to use a small open source library with some utils, one of which allows you co convert Strings into UNICODE (UTF-8) representation and back. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison. This article describes the library and how to use it.You can download the sources and javadoc as well. In particular look for paragraph "String Unicode converter". Using this class you will get your String "qwertyuiop" converted into: "\u0071\u0077\u0065\u0072\u0074\u0079\u0075\u0069\u006f\u0070" Each four digits after \u symbol represent 2 bytes (one character) in hexadecimal representation.

Upvotes: 0

kennytm
kennytm

Reputation: 523294

The problem is not in Charset.encode(), but Buffer.array().

If you printed Charset.forName("UTF-8").encode(s), you will find the output to be

java.nio.HeapByteBuffer[pos=0 lim=10 cap=11]

The ByteBuffer has limit 10, the length of the string, and capacity 11, the total allocated size of the buffer. If you change the encoding the limit and capacity may have even wilder variation, e.g.

System.out.println(Charset.forName("UTF-16").encode(s));
// java.nio.HeapByteBuffer[pos=0 lim=22 cap=41]
// (2 extra bytes because of the BOM, not null-termination)

When you call .array(), it will return the whole backing array, so even stuff beyond the limit will be included.

The actual method to extract a Java byte array is through the .get() method:

ByteBuffer buf = Charset.forName("UTF-8").encode(s);
byte[] encoded = new byte[buf.limit()];
buf.get(encoded);
System.out.println(Arrays.toString(encoded));

Well this looks like a mess? Because "nio" means Native I/O. The Buffer type is created so that it can easily wrap a C array. It makes interacting with native code such as reading/writing file or sending/receiving network data very efficient. These NIO APIs typically take a Buffer directly, without constructing any byte[] in between. If you are only working with Buffer, the middle two lines do not need to exist :).

If the whole operation stays within Java, yes just call s.getBytes(StandardCharsets.UTF_8).

Upvotes: 12

Related Questions