Understanding how to write cache-friendly code

Question

I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.

So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.

I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).

I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.

Below is the code snippet I used:

#include "time.h"
#include 

#define S 512
#define M S
#define N S

int main() {
    // Summing in the row major order
    int x = 0;
    int iter = 25000;
    int i, j;
    int k[M][N];
    int sum = 0;    
    clock_t start, end;

    start = clock();
    while(x < iter) {
        for (i = 0; i < M; i++) {
            for(j = 0; j < N; j++) {
                sum += k[i][j];
            }
        }

        x++;
    }
    end = clock();
    printf("%i
", end-start);

    // Summing in the column major order
    x = 0;
    sum = 0;
    int h[M][N];

    start = clock();
    while(x < iter) {
        for (j = 0; j < N; j++) {
            for(i = 0; i < M; i++){
                sum += k[i][j];
            }
        }

        x++;
    }
    end = clock();
    printf("%i
", end-start);
}

Question : can some one tell me what is my mistake and why I am getting this result?

bolov · Accepted Answer

I don't really know why you get this behaviour, but let me clarify some things.

There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).

Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.

Edit

I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x

Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.

Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".

Understanding how to write cache-friendly code

Answers (2)

Related Questions