Reputation: 4755
For any given Java String s
, I would like to know if the array of characters represented by s
is guaranteed to be a valid UTF-16 string, e.g.:
final char[] ch = new char[s.length()];
for (int i = 0; i < ch.length; ++i) {
ch[i] = s.charAt(i);
}
// Is ch guaranteed to be a valid UTF-16 encoded string?
If not, what are some simple Java-language test cases that produce invalid UTF-16?
EDIT: Somebody has flagged the question as a possible duplicate of [Is a Java char array always a valid UTF-16 (Big Endian) encoding? All I can say is, there's a difference between a String
and a char[]
and a reason why the former might, at least theoretically, have guarantees as to its contents that the latter does not. I'm not asking a question about arrays, I'm asking a question about String
s.
Upvotes: 2
Views: 1487
Reputation: 21249
No. A String
is simply an unrestricted wrapper for a char[]
:
char data[] = {'\uD800', 'b', 'c'}; // Unpaired lead surrogate
String str = new String(data);
To test a String
or char[]
for well-formed UTF-16 data, you can use CharsetEncoder
:
CharsetEncoder encoder = Charset.forName("UTF-16LE").newEncoder();
ByteBuffer bytes = encoder.encode(CharBuffer.wrap(str)); // throws MalformedInputException
Upvotes: 5
Reputation: 80404
No, an instance of a Java String is not guaranteed to contain a valid sequence of UTF-16 code units (that is, of 16-bit values) at all points during a program's execution. It really has to work this way, too.
This is trivial to prove. Imagine you have a sequence of code points (which are 21-bit quantities typically stored in 32-bit ints) that you wish to append to a String, one char unit at a time. If some of those code points lie above the Basic Multilingual Plane (that is, have values > 0xFFFF and so requiring more than 16 bits to hold them), then when adding 16-bit code units one at a time, you will have a point during which the String has only a leading surrogate but not yet the required trailing surrogate.
In other words, it works more like a char-unit buffer — a buffer of 16-bit values — than it does a legal UTF-16 sequence. This really is a necessary aspect of the String type.
Only when converting this to a particular encoding would there be any trouble, since mismatched, flipped, or lone surrogates are not legal in any of the three UTF forms, and therefore the encoder would be unable to represent them.
Upvotes: 5