Thomas Uhrig
Thomas Uhrig

Reputation: 31605

Converting string to byte[] returns wrong value (encoding?)

I read a byte[] from a file and convert it to a String:

byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");

I want to compare this to another byte[] I get from a web service:

String stringFromWebService = webService.getMyByteString(); 
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");

So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:

// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);

// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);

Why does the second assertion fail?

Upvotes: 7

Views: 2087

Answers (3)

J Richard Snape
J Richard Snape

Reputation: 20344

Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.

However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:

  • Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
  • This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
  • This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
  • In this case it uses an action defined by

    private CodingErrorAction unmappableCharacterAction
        = CodingErrorAction.REPORT;
    
  • Which means that it still reports the character it has decoded, even though it's technically unmappable.

  • I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.

This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:

} catch (CharacterCodingException x) {
            // Substitution is always enabled,
            // so this shouldn't happen

Upvotes: 2

Giovanni
Giovanni

Reputation: 4015

The real problem in your code is that you don't know what the real file encoding. When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.

By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.

So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.

Upvotes: 0

Thomas Uhrig
Thomas Uhrig

Reputation: 31605

I don't understand it fully, but here's what I get so fare:

The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:

// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
    CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
    try {
        cs.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.

However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.

Upvotes: 1

Related Questions