Why does UTF8 encoding change/corrupt bytes as oppose to Base64 and ASCII, when writing to file?

Question

I am writing an application, which would receive encrypted byte array, consisting of file name and file bytes, with the following protocol: file_name_and_extension|bytes. Byte array is then decrypted and passing into Encoding.UTF8.getString(decrypted_bytes) would be preferable, because I would like to trim file_name_and_extension from the received bytes to save actual file bytes into file_name_and_extension.

I simplified my application, to only receive file bytes which are then passed into Encoding.UTF8.GetString() and back into byte array with Encoding.UTF8.getBytes(). After that, I am trying to write a zip file, but the file is invalid. It works when using ASCII or Base64.

private void Decryption(byte[] encryptedMessage, byte[] iv)
{
    using (Aes aes = new AesCryptoServiceProvider())
    {
        aes.Key = receiversKey;
        aes.IV = iv;
        // Decrypt the message
        using (MemoryStream decryptedBytes = new MemoryStream())
        {
            using (CryptoStream cs = new CryptoStream(decryptedBytes, aes.CreateDecryptor(), CryptoStreamMode.Write))
            {
                cs.Write(encryptedMessage, 0, encryptedMessage.Length);
                cs.Close();

                string decryptedBytesString = Encoding.UTF8.GetString(decryptedBytes.ToArray()); //corrupts the zip
                //string decryptedBytesString = Encoding.ASCII.GetString(decryptedBytes.ToArray()); //works
                //String decryptedBytesString = Convert.ToBase64String(decryptedBytes.ToArray()); //works

                byte[] fileBytes = Encoding.UTF8.GetBytes(decryptedBytesString);
                //byte[] fileBytes = Encoding.ASCII.GetBytes(decryptedBytesString);
                //byte[] fileBytes = Convert.FromBase64String(decryptedBytesString);
                File.WriteAllBytes("RECEIVED\received.zip", fileBytes);

            }
        }
    }
}

Eugene Podskal · Accepted Answer

Because one shouldn't try to interpret raw bytes as symbols in some encoding unless he actually knows/can deduce the encoding used.

If you receive some nonspecific raw bytes, then process them as raw bytes.

But why it works/doesn't work?

Because:

Encoding.Ascii seems to ignore values greater than 127 and return them as they are. So no matter the encoding/decoding done, raw bytes will be the same.
Base64 is a straightforward encoding that won't change the original data in any way.
UTF8 - theoretically with those bytes not being proper UTF8 string we may have some conversion data loss (though it would more likely result in an exception). But the most probable reason is a BOM being added during Encoding.UTF8.GetString call that would remain there after Encoding.UTF8.GetBytes.

In any case, I repeat - do not encode/decode anything unless it is actually string data/required format.

Why does UTF8 encoding change/corrupt bytes as oppose to Base64 and ASCII, when writing to file?

Answers (1)

But why it works/doesn't work?

Related Questions