Is an instance of a Java string always valid UTF-16?

Question

For any given Java String s, I would like to know if the array of characters represented by s is guaranteed to be a valid UTF-16 string, e.g.:

final char[] ch = new char[s.length()];
for (int i = 0; i < ch.length; ++i) {
    ch[i] = s.charAt(i);
}
// Is ch guaranteed to be a valid UTF-16 encoded string?

If not, what are some simple Java-language test cases that produce invalid UTF-16?

EDIT: Somebody has flagged the question as a possible duplicate of [Is a Java char array always a valid UTF-16 (Big Endian) encoding? All I can say is, there's a difference between a String and a char[] and a reason why the former might, at least theoretically, have guarantees as to its contents that the latter does not. I'm not asking a question about arrays, I'm asking a question about Strings.

一二三 · Accepted Answer

No. A String is simply an unrestricted wrapper for a char[]:

char data[] = {'\uD800', 'b', 'c'};  // Unpaired lead surrogate
String str = new String(data);

To test a String or char[] for well-formed UTF-16 data, you can use CharsetEncoder:

CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(str)); // throws MalformedInputException

Is an instance of a Java string always valid UTF-16?

Answers (2)

Related Questions