How come java.lang.String does not validate encoding?

Question

I ran in to something that surprised me a little. When trying to build a string from bytes that are not proper utf-8, the String constructor still gives me a result. No exception is thrown. Example:

byte[] x = { (byte) 0xf0, (byte) 0xab };
new String(x, "UTF-8"); // This works, or at least gives a result

// This however, throws java.nio.charset.MalformedInputException: Input length = 3
ByteBuffer wrapped = ByteBuffer.wrap(x);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.decode(wrapped);

Trying the same thing in for example python also gives an error, with a somewhat clearer error message:

   >>> '\xf0\xab'.decode('utf-8')
   Traceback (most recent call last):
     File "", line 1, in 
     File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
   UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

So why is it that the java string constructor seems to ignore errors in the input?

update: I should be a little more clear. The javadoc points out that this is unspecified. But what could the reason be for implementing it like this? It seems to me you would never want this sort of behavior and any time you can't be 100% sure of the source you would need to use the CharsetDecoder to be safe.

How come java.lang.String does not validate encoding?

Answers (1)

Related Questions