Reputation: 52646
I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.
So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.
I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).
I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.
Below is the code snippet I used:
#include "time.h"
#include <stdio.h>
#define S 512
#define M S
#define N S
int main() {
// Summing in the row major order
int x = 0;
int iter = 25000;
int i, j;
int k[M][N];
int sum = 0;
clock_t start, end;
start = clock();
while(x < iter) {
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
// Summing in the column major order
x = 0;
sum = 0;
int h[M][N];
start = clock();
while(x < iter) {
for (j = 0; j < N; j++) {
for(i = 0; i < M; i++){
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
}
Question : can some one tell me what is my mistake and why I am getting this result?
Upvotes: 5
Views: 3749
Reputation: 803
Although, this question is 11y old and already marked as answered, I found the OP's problem super interesting and not explained well. I've taken a look at the code with different optimization levels and found few things.
With GCC's -O0
(which is the default one) I'm getting the same results as OP had:
$ gcc -O0 main.c
$ ./a.out
10964492
9092597
but -0g
shows even bigger difference:
$ gcc -Og main.c
$ ./a.out
3203853
1639167
Despite code running faster in -0g
, the delta time is the same compared to -00
. This most likely indicates there is small overhead during 1st summing. I bet this is because 2nd summing explicitly reuses variables from the 1st one.
-O1
shows almost no delta time between summings, however, in -02
and -03
almost whole code gets optimized away resulting in instant execution:
$ gcc -O2 main.c
$ ./a.out
1
1
The disassembly shows that the loops are totally gone. Only calls to printf
, clock
and their setup are still there, nothing else.
Now, the funny part, why is this happening and how to prevent it, how to make the actual measurement and get the correct results.
First of all, OP's code has almost no side effects and uses uninitialized memory which causes undefined behavior. This creates a perfect opportunity for compiler to optimize everything, even during early stages of compilation.
Another thing is that stride between columns is really small. OP already mentioned in his question that the whole array may fit into L3 cache (which is probably true). One cache line is usually 64 bytes [reference] so even when "iterating by the columns" the lines may still be in cache unless some other process overrides them.
The other thing is that both buffers are allocated on the stack and since they're uninitialized they may even share the same memory block.
Considering what mentioned above I've tweaked the code a bit. I introduced more side effects by printing the results which should prevent aggressive optimizations. The buffers are stored on the heap and memory is initialized (though they seem to be on the same index all the time so this is being optimized either by kernel or libc). Strides are also bigger. The only problem is that summing causes integer overflows but this is well defined on x86_64 (value should wrap around in this case) and sum values are always the same.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define M 12345
#define N 54321
int main(void)
{
{ // Summing in the row major order
int *k = malloc(sizeof(int) * M * N);
printf("k: %ld\n", (size_t)k);
const clock_t start = clock();
for (int x = 0; x < M * N; ++x)
k[x] = x;
int sum = 0;
for (int i = 0; i < M; i++)
for(int j = 0; j < N; j++)
sum += k[i*N + j]; // <----- pay attention to this line
printf("sum: %d\n", sum);
const clock_t end = clock();
printf("time: %f\n", (float)(end-start)/(1000000));
free(k);
}
{ // Summing in the column major order
int *k = malloc(sizeof(int) * M * N);
printf("k: %ld\n", (size_t)k);
const clock_t start = clock();
for (int x = 0; x < M * N; ++x)
k[x] = x;
int sum = 0;
for (int i = 0; i < M; i++)
for(int j = 0; j < N; j++)
sum += k[i + j*M]; // <----- pay attention to this line
printf("sum: %d\n", sum);
const clock_t end = clock();
printf("time: %f\n", (float)(end-start)/(1000000));
free(k);
}
}
Results appear correct and the differences between cache hits and misses are noticeable across different optimizations:
$ gcc -O0 main2.c
$ ./a.out
k: 136416159858704
sum: -188591980
time: 2.688405
k: 136416159858704
sum: -188591980
time: 4.117664
$ gcc -O2 main2.c
$ ./a.out
k: 138602927357968
sum: -188591980
time: 0.626179
k: 138602927357968
sum: -188591980
time: 1.408703
Swapping the order of summings also produces expected results:
$ gcc -O0 main3.c
$ ./a.out
k: 130537320611856
sum: -188591980
time: 4.169673
k: 130537320611856
sum: -188591980
time: 2.678053
However, -03
again seems to optimize the loops. They're not entirely gone like with the OP's code but the access pattern has been optimized:
$ gcc -O3 main2.c
$ ./a.out
k: 131184013082640
sum: -188591980
time: 0.366076
k: 131184013082640
sum: -188591980
time: 0.363325
I guess, the code is too simple and GCC is just way too smart in -03
:)
I used GCC version 14.1.1 20240522
with CPU Intel i5-8600K 4.30GHz 6c/6t
on Linux kernel 6.9.6
.
I hope someone will find this useful someday just like I did.
Cheers!
Upvotes: 0
Reputation: 75854
I don't really know why you get this behaviour, but let me clarify some things.
There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).
Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.
Edit
I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x
Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.
Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".
Upvotes: 12