Reputation: 31605
I read a byte[]
from a file and convert it to a String
:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[]
I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[]
from a file and convert it to a String
and I get a String
from my web service and convert it to a byte[]
. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?
Upvotes: 7
Views: 2087
Reputation: 20344
Other answers have covered the likely fact that the file is not UTF-8
encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[]
assert fails, but that the assert
that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
new String(bytesFromFile, "UTF-8");
works - we see that the constructor calls through to StringCoding.decode()
UTF-8
character set, calls through to StringDecoder.decode()
CharsetDecoder.decode()
which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8
character is presented)In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String
representations are the same under comparison, but the byte[]
are no longer the same.
This hypothesis is kind of supported by the fact that the catch
block for CharacterCodingException
in StringCoding.decode()
says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen
Upvotes: 2
Reputation: 4015
The real problem in your code is that you don't know what the real file encoding. When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.
By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.
So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.
Upvotes: 0
Reputation: 31605
I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1
everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");
) doesn't throw any exception (like my isValidUTF8
method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[]
in a Base64 string as I don't want more trouble with encoding.
Upvotes: 1