qub3
qub3

Reputation: 143

Speed of AES Implementation

I have written an C implementation of AES and have tried to make it as fast as possible (Im just starting out in Programming and have training in IT). I have achieved an Speed increase of around 600% so far but its still awfully slow. To Compare my AES-Implementation with something i have used the "openssl speed" command in the Linux-Terminal. In 3 seconds this implementation encrypts around 36 977 043 blocks (16byte). I am ~25 times slower (at 72 seconds for the 36... bytes) than that which kinda sucks. Im curious about 2 things.

  1. What would be a good goal to achieve, how fast is a realistic goal to aim at.
  2. Why is my Code so slow, and how can i change that.

To my code: I have tried to leave out on some of my functions so see how much faster the code gets without them. The full code took 72 seconds.

My encryption function:

uint32_t * encrypt(uint32_t * expkey,uint32_t state[4]){

    uint32_t temp[4];

    state[0] = state[0] ^ expkey[0];
    state[1] = state[1] ^ expkey[1];
    state[2] = state[2] ^ expkey[2];
    state[3] = state[3] ^ expkey[3];
    
    for(int round = 1; round < Nr; round++){
        
        // Subbytes
        for (int c = 0; c < 4;c++){
            temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]);
        }
        // Shiftrows
        state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);
        state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF);
        state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF);
        state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF);
        
        // Mixcolums
        for (int c = 0; c < 4;c++){
            state[c] = 
                ((xtime((state[c] >> 24) & 0xFF) ^ xtime3((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 24) +
                ((((state[c] >> 24) & 0xFF) ^ xtime((state[c] >> 16) & 0xFF) ^ xtime3((state[c] >> 8) & 0xFF) ^ (state[c] & 0xFF)) << 16) + 
                ((((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ xtime((state[c] >> 8) & 0xFF) ^ xtime3(state[c] & 0xFF)) << 8 ) +
                (xtime3((state[c] >> 24) & 0xFF) ^ ((state[c] >> 16) & 0xFF) ^ ((state[c] >> 8) & 0xFF) ^ xtime(state[c] & 0xFF));       
        
        }
        // Add Key
        state[0] = state[0] ^ expkey[round * 4];
        state[1] = state[1] ^ expkey[round * 4 + 1];
        state[2] = state[2] ^ expkey[round * 4 + 2];
        state[3] = state[3] ^ expkey[round * 4 + 3];
        
        }
        // Last Subbytes
        for (int c = 0; c < 4;c++){
            temp[c] = ((sbox[state[c] >> 24 & 0xFF]) << 24 ) + ((sbox[state[c] >> 16 & 0xFF]) << 16 ) + ((sbox[state[c] >> 8 & 0xFF]) << 8 ) + (sbox[state[c] & 0xFF]);
        }
        */
        // Last Shiftrow
        state[0] = (((temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);
        state[1] = (((temp[1] >> 24) & 0xFF) << 24) + (((temp[2] >> 16) & 0xFF) << 16) + (((temp[3] >> 8) & 0xFF) << 8) + (temp[0] & 0xFF);
        state[2] = (((temp[2] >> 24) & 0xFF) << 24) + (((temp[3] >> 16) & 0xFF) << 16) + (((temp[0] >> 8) & 0xFF) << 8) + (temp[1] & 0xFF);
        state[3] = (((temp[3] >> 24) & 0xFF) << 24) + (((temp[0] >> 16) & 0xFF) << 16) + (((temp[1] >> 8) & 0xFF) << 8) + (temp[2] & 0xFF);
        
        // Last Add Key
        state[0] = state[0] ^ expkey[Nr * 4];
        state[1] = state[1] ^ expkey[Nr * 4 + 1];
        state[2] = state[2] ^ expkey[Nr * 4 + 2];
        state[3] = state[3] ^ expkey[Nr * 4 + 3];
        
        return state;
}

And the xtime function:

uint8_t xtime(uint8_t x){
    return (x << 1) ^ (0x11b & -(x >> 7));
}

I am looking forward to all tips tricks and improvements.

Upvotes: 3

Views: 2524

Answers (2)

Svetlyo
Svetlyo

Reputation: 107

You have to optimize the code.

  1. There are unnecessary shifts << and back >>. You can simply mask. Replace code like this:
= (temp[0] >> 24) & 0xFF) << 24) + (((temp[1] >> 16) & 0xFF) << 16) + (((temp[2] >> 8) & 0xFF) << 8) + (temp[3] & 0xFF);

with code like this:

= (temp[0] & 0xFF000000) + (temp[1] & 0x00FF0000) + (temp[2] & 0x0000FF00) << 8) + (temp[3] & 0xFF);
  1. Replace this:
 ^ expkey[Nr *  4    ];
 ^ expkey[Nr *  4 + 1];
 ^ expkey[Nr *  4 + 2];
 ^ expkey[Nr *  4 + 3];

with this:

ptr = expkey + (Nr << 2);   // Shift 2 is equal to * 2

^ ptr[0];
^ ptr[1];
^ ptr[2];
^ ptr[3];

This way you will eliminate a lot of computations

Upvotes: 2

kelalaka
kelalaka

Reputation: 5636

The OpenSSL is using the AES-NI where available.

openssl speed -evp aes-128-cbc

and outputs


The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc    531549.19k   969335.21k  1045437.10k  1066826.75k  1054665.39k  1052120.41k

Since you are not using the AES-NI you need to compare it with the software version

 OPENSSL_ia32cap=”~0x200000200000000″ openssl speed -elapsed -evp aes-128-cbc

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc    143802.75k   161369.51k   165049.17k   166054.57k   166262.10k   166461.44k

if we compare the last column, you will see that AES-NI is ~6.3 times faster than the software version of the OpenSSL. This means that you are around 4 times slower than the software version.

In many cases, the compiler optimization parameters can also affect the speed, too. Look into the manual of your compiler, if you are using GCC then they are -O[0..3]


About the code;

If you look at the AES code of OpenSSL you will see that they use pre-computed tables and this is a very common technique.

The Subytes, Shiftrows and MixColums are turned into table lookup. The speed difference is these. And note that the table lookup is vulnerable to cache-timing attacks.

Upvotes: 5

Related Questions