AES encryption issue for text consisting of Chinese characters

Question

I have an issue with Chinese string gotten from a MySQL database. This database has a default setup:

character_set_database: latin1
information_schema: utf8
collation: utf8_general_ci

For the schema I am working with:

charset: latin1
collation: latin1_swedish_ci

I have imported this database using an SQL dump.

The tables contain both Latin data and Chinese data. This is a worldwide database.

I can read all of them in Java.

My issue arise when I want to encrypt the data. I am using AES with Java crypto, and return the bytes in a string using Base64.encode

Encryption runs fine. My issue is that when I encrypt the Chinese characters the encrypted string I am getting back is far too big (like 300 chars), although the Chinese text is only few characters long.

The encryption code is like this

Cipher cipher = Cipher.getInstance("AES");
cipher.init(Cipher.ENCRYPT_MODE, aesKey);
byte[] encrypted = cipher.doFinal(value.getBytes("UTF-8"));

String encoded = Base64.encodeBase64String(encrypted);
return new String(encoded.getBytes("UTF-8"));

Do you have any idea why the encrypted value is so long? Should I handle the Chinese values differently before encrypting them?

Addendum:

When I debug: If I encrypt this: 桃草夹芥人蕉芥玉芥花荷子衣兰芥花

I get the result String String value = ENCR({FDDabCcaDabp6YSLYCzg/1MuSzt8QPGEEk3ymeAOW5vERBk+oN3bMSUV5bEbocifr216yqUCObrqDjrrhVwGDqzafWVbELpTQ==}_AB_DCD_)

When I call value.length I get 115. And 115 is just too long for my DB.

I think the chinese characters are more than two bytes long? Is it a correct assumption?

Do you see the reason why I get length = 115?

Thanks

=================================== ADDENDUM 2

The code is:

    try {
        String english = "Rastapopoulos";
        String chinese = "桃草夹芥人蕉芥玉芥花荷子衣兰芥花";
        String transformationKey = "asdewqayxswedcvf";
        Key aesKey = new SecretKeySpec(transformationKey.getBytes("UTF-8"), "AES");
        Cipher cipher = Cipher.getInstance("AES");
        cipher.init(Cipher.ENCRYPT_MODE, aesKey);

        byte[] encrypted1 = cipher.doFinal(english.getBytes("UTF-8"));
        String encoded1 = Base64.encodeBase64String(encrypted1);

        byte[] encrypted2 = cipher.doFinal(chinese.getBytes("UTF-8"));
        String encoded2 = Base64.encodeBase64String(encrypted2);

        System.out.println("Original length: " + english.length() + "	Encrypted length: " + encoded1.length() + "	" + encoded1);
        System.out.println("Original length: " + chinese.length() + "	Encrypted length: " + encoded2.length() + "	" + encoded2);
    } catch (Exception e) {
        e.printStackTrace();
    }

And gives me the following output

Original length: 13 Encrypted length: 24 V4y9u3tNQaH81BAcqi1XZg== Original length: 16 Encrypted length: 88 KTMAxhqALAlXfjaOLsBlbj7jbqz+8M4F0AlvvUU5OmrvT+D7MGQHseYKm32V46bqyNbHtu91JC4sQ+mVoWp/wQ==

Which is similar from what you got

My issue is that I can't write it back to the DB because it is larger than the max length of the field. But what I don't understand is why my english strings of 13-15 characters give me 24 bytes lenght, all the time, and why my 16 bytes of chinese characters give me a 88 bytes long encrypted value.

Where does this difference comes from?

The value in the DB are pretty small, less than 20 chars, so I should not have any issue at encrypting it. The result will always be less than 24 chars long. So why is it different for chinese characters?

Thanks

Maarten Bodewes · Accepted Answer

UTF-8 is not the best possible encoding for Chinese characters as they are predominantly encoded as multiple bytes.

Furthermore, CBC mode + PKCS#7 padding (called PKCS5Padding in Java) is not the most efficient mode either as it requires a large, random IV as well as padding.

So to have a smaller encoded value, try UTF-16 and CTR encoding, where the IV is consists of just a 8 byte nonce (included with the ciphertext) and no padding.

Example code:

SecureRandom rng = new SecureRandom();
SecretKey aesKey = new SecretKeySpec(new byte[16], "AES");

String chinese = "桃草夹芥人蕉芥玉芥花荷子衣兰芥花";
byte[] utf8Chinese = chinese.getBytes(UTF_8);
System.out.printf("UTF-8    encoded : %d bytes: %s%n", utf8Chinese.length, Hex.toHexString(utf8Chinese));

{
    Cipher aesCBC = Cipher.getInstance("AES/CBC/PKCS5Padding");

    byte[] ivBytes = new byte[aesCBC.getBlockSize()];
    rng.nextBytes(ivBytes);
    aesCBC.init(Cipher.ENCRYPT_MODE, aesKey, new IvParameterSpec(ivBytes));

    byte[] cipherTextCBC = aesCBC.doFinal(utf8Chinese);
    byte[] ivAndCipherTextCBC = Arrays.concatenate(ivBytes, cipherTextCBC);

    System.out.printf("UTF-8, CBC    encoded : %d bytes: %s%n", ivAndCipherTextCBC.length, Hex.toHexString(ivAndCipherTextCBC));
}

byte[] utf16Chinese = chinese.getBytes(UTF_16BE);
System.out.printf("UTF-16BE encoded : %d bytes: %s%n", utf16Chinese.length, Hex.toHexString(utf16Chinese));

{
    Cipher aesCTR = Cipher.getInstance("AES/CTR/NoPadding");

    byte[] nonce = new byte[8];
    rng.nextBytes(nonce);
    byte[] initialCounterValue = new byte[8];
    byte[] ivForCTR = Arrays.concatenate(nonce, initialCounterValue);
    aesCTR.init(Cipher.ENCRYPT_MODE, aesKey, new IvParameterSpec(ivForCTR));

    byte[] cipherTextCTR = aesCTR.doFinal(utf16Chinese);
    byte[] ivAndCipherTextCTR = Arrays.concatenate(ivForCTR, cipherTextCTR);

    System.out.printf("UTF-16BE, CTR encoded : %d bytes: %s%n", ivAndCipherTextCTR.length, Hex.toHexString(ivAndCipherTextCTR));
}

And finally the output:

UTF-8    encoded : 48 bytes: e6a183e88d89e5a4b9e88aa5e4babae89589e88aa5e78e89e88aa5e88ab1e88db7e5ad90e8a1a3e585b0e88aa5e88ab1
UTF-8, CBC    encoded : 80 bytes: c109837322fcd5472539bb7cb51dd6841cea744273979cdbed54d9db019747d41b4e784c22f8e6384e92135ff37747797796baa438f26c914dc5ab99b17afc30771e0b18263d2061d971ef54c457c1b9
UTF-16BE encoded : 32 bytes: 68438349593982a54eba854982a5738982a582b183775b508863517082a582b1
UTF-16BE, CTR encoded : 48 bytes: 9c6afe2d8899284f0000000000000000cad3877bee435324ffa671f956781f2838279fe56e811c9ba5bcf98a6cc98a7f

And there you have it: 32 fewer bytes. And that's before base 64 encoding which will expand the ciphertext with another 1/3rd, at least when the result is put into a column that uses an ASCII compatible encoding such as UTF-8. Note that you don't want to use UTF-16 for the base 64 encoded result after encryption (just storing binary - without encoding to base 64 - is of course best).

Notes:

the IV for CBC and nonce for CTR mode are required; if they are not used then the encryption does not offer full confidentiality (and approximately no confidentiality for CTR);
don't encrypt more than 2^16 plaintexts with the same key for CTR using above scheme.

AES encryption issue for text consisting of Chinese characters

Answers (2)

Related Questions