Maarten Bodewes
Maarten Bodewes

Reputation: 93978

Is a Java char array always a valid UTF-16 (Big Endian) encoding?

Say that I would encode a Java character array (char[]) instance as bytes:

Would this always create a valid UTF-16BE encoding? If not, which code points will result in an invalid encoding?


This question is very much related to this question about the Java char type and this question about the internal representation of Java strings.

Upvotes: 7

Views: 2119

Answers (1)

一二三
一二三

Reputation: 21249

No. You can create char instances that contain any 16-bit value you desire---there is nothing that constrains them to be valid UTF-16 code units, nor constrains an array of them to be a valid UTF-16 sequence. Even String does not require that its data be valid UTF-16:

char data[] = {'\uD800', 'b', 'c'};  // Unpaired lead surrogate
String str = new String(data);

The requirements for valid UTF-16 data are set out in Chapter 3 of the Unicode Standard (basically, everything must be a Unicode scalar value, and all surrogates must be correctly paired). You can test if a char array is a valid UTF-16 sequence, and turn it into a sequence of UTF-16BE (or LE) bytes, by using a CharsetEncoder:

CharsetEncoder encoder = Charset.forName("UTF-16BE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(data)); // throws MalformedInputException

(And similarly using a CharsetDecoder if you have bytes.)

Upvotes: 9

Related Questions