Reputation: 3757
I ran in to something that surprised me a little. When trying to build a string from bytes that are not proper utf-8, the String constructor still gives me a result. No exception is thrown. Example:
byte[] x = { (byte) 0xf0, (byte) 0xab };
new String(x, "UTF-8"); // This works, or at least gives a result
// This however, throws java.nio.charset.MalformedInputException: Input length = 3
ByteBuffer wrapped = ByteBuffer.wrap(x);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.decode(wrapped);
Trying the same thing in for example python also gives an error, with a somewhat clearer error message:
>>> '\xf0\xab'.decode('utf-8')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data
So why is it that the java string constructor seems to ignore errors in the input?
update: I should be a little more clear. The javadoc points out that this is unspecified. But what could the reason be for implementing it like this? It seems to me you would never want this sort of behavior and any time you can't be 100% sure of the source you would need to use the CharsetDecoder to be safe.
Upvotes: 1
Views: 319
Reputation: 78855
The Java documentation for String(byte[], String) says:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.
Thee constructor String(byte[], Charset) has yet another behavior:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.
I like Phython's behavior better. But you can't expect Java to be exactly like Python.
Upvotes: 1