Reputation: 369
I have an issue with Chinese string gotten from a MySQL database. This database has a default setup:
For the schema I am working with:
I have imported this database using an SQL dump.
The tables contain both Latin data and Chinese data. This is a worldwide database.
I can read all of them in Java.
My issue arise when I want to encrypt the data. I am using AES with Java crypto, and return the bytes in a string using Base64.encode
Encryption runs fine. My issue is that when I encrypt the Chinese characters the encrypted string I am getting back is far too big (like 300 chars), although the Chinese text is only few characters long.
The encryption code is like this
Cipher cipher = Cipher.getInstance("AES");
cipher.init(Cipher.ENCRYPT_MODE, aesKey);
byte[] encrypted = cipher.doFinal(value.getBytes("UTF-8"));
String encoded = Base64.encodeBase64String(encrypted);
return new String(encoded.getBytes("UTF-8"));
Do you have any idea why the encrypted value is so long? Should I handle the Chinese values differently before encrypting them?
Addendum:
When I debug: If I encrypt this: 桃草夹芥人蕉芥玉芥花荷子衣兰芥花
I get the result String String value = ENCR({FDDabCcaDabp6YSLYCzg/1MuSzt8QPGEEk3ymeAOW5vERBk+oN3bMSUV5bEbocifr216yqUCObrqDjrrhVwGDqzafWVbELpTQ==}_AB_DCD_)
When I call value.length I get 115. And 115 is just too long for my DB.
I think the chinese characters are more than two bytes long? Is it a correct assumption?
Do you see the reason why I get length = 115?
Thanks
=================================== ADDENDUM 2
The code is:
try {
String english = "Rastapopoulos";
String chinese = "桃草夹芥人蕉芥玉芥花荷子衣兰芥花";
String transformationKey = "asdewqayxswedcvf";
Key aesKey = new SecretKeySpec(transformationKey.getBytes("UTF-8"), "AES");
Cipher cipher = Cipher.getInstance("AES");
cipher.init(Cipher.ENCRYPT_MODE, aesKey);
byte[] encrypted1 = cipher.doFinal(english.getBytes("UTF-8"));
String encoded1 = Base64.encodeBase64String(encrypted1);
byte[] encrypted2 = cipher.doFinal(chinese.getBytes("UTF-8"));
String encoded2 = Base64.encodeBase64String(encrypted2);
System.out.println("Original length: " + english.length() + "\tEncrypted length: " + encoded1.length() + "\t" + encoded1);
System.out.println("Original length: " + chinese.length() + "\tEncrypted length: " + encoded2.length() + "\t" + encoded2);
} catch (Exception e) {
e.printStackTrace();
}
And gives me the following output
Original length: 13 Encrypted length: 24 V4y9u3tNQaH81BAcqi1XZg== Original length: 16 Encrypted length: 88 KTMAxhqALAlXfjaOLsBlbj7jbqz+8M4F0AlvvUU5OmrvT+D7MGQHseYKm32V46bqyNbHtu91JC4sQ+mVoWp/wQ==
Which is similar from what you got
My issue is that I can't write it back to the DB because it is larger than the max length of the field. But what I don't understand is why my english strings of 13-15 characters give me 24 bytes lenght, all the time, and why my 16 bytes of chinese characters give me a 88 bytes long encrypted value.
Where does this difference comes from?
The value in the DB are pretty small, less than 20 chars, so I should not have any issue at encrypting it. The result will always be less than 24 chars long. So why is it different for chinese characters?
Thanks
Upvotes: 3
Views: 3398
Reputation: 142453
In MySQL, use CHARACTER SET utf8mb4
(not latin1, not utf8) on any columns that will have Chinese in them. That corresponds to UTF-8
outside MySQL.
Do not use UTF16 for anything unless that happens to be the encoding of some source text.
SELECT length(aes_encrypt("桃草夹芥人蕉芥玉芥花荷子衣兰芥花", 'AES')) --> 64; I don't know where you are getting 24. Furthermore, the output from aes_encrypt is always a multiple of 16 bytes.
If you are going to store the encryted value in MySQL, you must do one of these:
VARBINARY(...)
or BLOB
, orVARCHAR
/TEXT
column, but take the HEX
/BASE64
of the aes_encrypt output.Upvotes: 2
Reputation: 94038
UTF-8 is not the best possible encoding for Chinese characters as they are predominantly encoded as multiple bytes.
Furthermore, CBC mode + PKCS#7 padding (called PKCS5Padding
in Java) is not the most efficient mode either as it requires a large, random IV as well as padding.
So to have a smaller encoded value, try UTF-16 and CTR encoding, where the IV is consists of just a 8 byte nonce (included with the ciphertext) and no padding.
Example code:
SecureRandom rng = new SecureRandom();
SecretKey aesKey = new SecretKeySpec(new byte[16], "AES");
String chinese = "桃草夹芥人蕉芥玉芥花荷子衣兰芥花";
byte[] utf8Chinese = chinese.getBytes(UTF_8);
System.out.printf("UTF-8 encoded : %d bytes: %s%n", utf8Chinese.length, Hex.toHexString(utf8Chinese));
{
Cipher aesCBC = Cipher.getInstance("AES/CBC/PKCS5Padding");
byte[] ivBytes = new byte[aesCBC.getBlockSize()];
rng.nextBytes(ivBytes);
aesCBC.init(Cipher.ENCRYPT_MODE, aesKey, new IvParameterSpec(ivBytes));
byte[] cipherTextCBC = aesCBC.doFinal(utf8Chinese);
byte[] ivAndCipherTextCBC = Arrays.concatenate(ivBytes, cipherTextCBC);
System.out.printf("UTF-8, CBC encoded : %d bytes: %s%n", ivAndCipherTextCBC.length, Hex.toHexString(ivAndCipherTextCBC));
}
byte[] utf16Chinese = chinese.getBytes(UTF_16BE);
System.out.printf("UTF-16BE encoded : %d bytes: %s%n", utf16Chinese.length, Hex.toHexString(utf16Chinese));
{
Cipher aesCTR = Cipher.getInstance("AES/CTR/NoPadding");
byte[] nonce = new byte[8];
rng.nextBytes(nonce);
byte[] initialCounterValue = new byte[8];
byte[] ivForCTR = Arrays.concatenate(nonce, initialCounterValue);
aesCTR.init(Cipher.ENCRYPT_MODE, aesKey, new IvParameterSpec(ivForCTR));
byte[] cipherTextCTR = aesCTR.doFinal(utf16Chinese);
byte[] ivAndCipherTextCTR = Arrays.concatenate(ivForCTR, cipherTextCTR);
System.out.printf("UTF-16BE, CTR encoded : %d bytes: %s%n", ivAndCipherTextCTR.length, Hex.toHexString(ivAndCipherTextCTR));
}
And finally the output:
UTF-8 encoded : 48 bytes: e6a183e88d89e5a4b9e88aa5e4babae89589e88aa5e78e89e88aa5e88ab1e88db7e5ad90e8a1a3e585b0e88aa5e88ab1
UTF-8, CBC encoded : 80 bytes: c109837322fcd5472539bb7cb51dd6841cea744273979cdbed54d9db019747d41b4e784c22f8e6384e92135ff37747797796baa438f26c914dc5ab99b17afc30771e0b18263d2061d971ef54c457c1b9
UTF-16BE encoded : 32 bytes: 68438349593982a54eba854982a5738982a582b183775b508863517082a582b1
UTF-16BE, CTR encoded : 48 bytes: 9c6afe2d8899284f0000000000000000cad3877bee435324ffa671f956781f2838279fe56e811c9ba5bcf98a6cc98a7f
And there you have it: 32 fewer bytes. And that's before base 64 encoding which will expand the ciphertext with another 1/3rd, at least when the result is put into a column that uses an ASCII compatible encoding such as UTF-8. Note that you don't want to use UTF-16 for the base 64 encoded result after encryption (just storing binary - without encoding to base 64 - is of course best).
Notes:
Upvotes: 2