cache friendly C++ operation on matrix in C++?

Question

My application does some operations on matrices of large size. I recently came accross the concept of cache & the performance effect it can have through this answer. I would like to know what would be the best algorithm which is cache friendly for my case.

Algorithm 1:
for(int i = 0; i < size; i++)
{
    for(int j = i + 1; j < size; j++)
    {
        c[i][j] -= K * c[j][j];//K is a constant double variable
    }//c is a 2 dimensional array of double variables
}

Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
    A[n] = c[n][n];

for(int i = 0; i < size; i++)
{
    for(int j = i + 1; j < size; j++)
    {
        c[i][j] -= K * A[j];
    }
}

The size of my array is more than 1000x1000. Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000. Please note that I have multi threaded my application such that a set of rows is operated by a thread.

For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999

Mats Petersson · Accepted Answer

If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?

The "copy the data in c[n][n]" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j], every j steps will jump sizeof(double) * (size * j + 1) bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.

In other words, for anything that has a decent size cache (bigger than size * sizeof(double)), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j].

In summary, the second algorithm is very likely better for nearly all options.

cache friendly C++ operation on matrix in C++?

Answers (2)

Related Questions