Reputation: 143
I have written an C implementation of AES and have tried to make it as fast as possible (Im just starting out in Programming and have training in IT). I have achieved an Speed increase of around 600% so far but its still awfully slow. To Compare my AES-Implementation with something i have used the "openssl speed" command in the Linux-Terminal. In 3 seconds this implementation encrypts around 36 977 043 blocks (16byte). I am ~25 times slower (at 72 seconds for the 36... bytes) than that which kinda sucks. Im curious about 2 things.
To my code: I have tried to leave out on some of my functions so see how much faster the code gets without them. The full code took 72 seconds.
My encryption function:
uint32_t * encrypt(uint32_t * expkey,uint32_t state[4]){
uint32_t temp[4];
state[0] = state[0] ^ expkey[0];
state[1] = state[1] ^ expkey[1];
state[2] = state[2] ^ expkey[2];
state[3] = state[3] ^ expkey[3];
for(int round = 1; round < Nr; round++){
// Subbytes
for (int c = 0; c < 4;c++){
temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]);
}
// Shiftrows
state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);
state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF);
state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF);
state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF);
// Mixcolums
for (int c = 0; c < 4;c++){
state[c] =
((xtime((state[c] >> 24) & 0xFF) ^ xtime3((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 24) +
((((state[c] >> 24) & 0xFF) ^ xtime((state[c] >> 16) & 0xFF) ^ xtime3((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 16) +
((((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ xtime((state[c] >> 8) & 0xFF) ^ xtime3(state[c] & 0xFF)) << 8 ) +
(xtime3((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ xtime(state[c] & 0xFF));
}
// Add Key
state[0] = state[0] ^ expkey[round * 4];
state[1] = state[1] ^ expkey[round * 4 + 1];
state[2] = state[2] ^ expkey[round * 4 + 2];
state[3] = state[3] ^ expkey[round * 4 + 3];
}
// Last Subbytes
for (int c = 0; c < 4;c++){
temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]);
}
*/
// Last Shiftrow
state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);
state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF);
state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF);
state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF);
// Last Add Key
state[0] = state[0] ^ expkey[Nr * 4];
state[1] = state[1] ^ expkey[Nr * 4 + 1];
state[2] = state[2] ^ expkey[Nr * 4 + 2];
state[3] = state[3] ^ expkey[Nr * 4 + 3];
return state;
}
And the xtime function:
uint8_t xtime(uint8_t x){
return (x << 1) ^ (0x11b & -(x >> 7));
}
I am looking forward to all tips tricks and improvements.
Upvotes: 3
Views: 2524
Reputation: 107
You have to optimize the code.
<<
and back >>
.
You can simply mask.
Replace code like this:= (temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);
with code like this:
= (temp[0] & 0xFF000000) + (temp[1] & 0x00FF0000) + (temp[2] & 0x0000FF00) << 8) + (temp[3] & 0xFF);
^ expkey[Nr * 4 ];
^ expkey[Nr * 4 + 1];
^ expkey[Nr * 4 + 2];
^ expkey[Nr * 4 + 3];
with this:
ptr = expkey + (Nr << 2); // Shift 2 is equal to * 2
^ ptr[0];
^ ptr[1];
^ ptr[2];
^ ptr[3];
This way you will eliminate a lot of computations
Upvotes: 2
Reputation: 5636
The OpenSSL is using the AES-NI where available.
openssl speed -evp aes-128-cbc
and outputs
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 531549.19k 969335.21k 1045437.10k 1066826.75k 1054665.39k 1052120.41k
Since you are not using the AES-NI you need to compare it with the software version
OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-128-cbc 143802.75k 161369.51k 165049.17k 166054.57k 166262.10k 166461.44k
if we compare the last column, you will see that AES-NI is ~6.3 times faster than the software version of the OpenSSL. This means that you are around 4 times slower than the software version.
In many cases, the compiler optimization parameters can also affect the speed, too. Look into the manual of your compiler, if you are using GCC then they are -O[0..3]
About the code;
If you look at the AES code of OpenSSL you will see that they use pre-computed tables and this is a very common technique.
The Subytes
, Shiftrows
and MixColums
are turned into table lookup. The speed difference is these. And note that the table lookup is vulnerable to cache-timing attacks.
Upvotes: 5