Reputation: 5073
My application does some operations on matrices of large size. I recently came accross the concept of cache & the performance effect it can have through this answer. I would like to know what would be the best algorithm which is cache friendly for my case.
Algorithm 1:
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * c[j][j];//K is a constant double variable
}//c is a 2 dimensional array of double variables
}
Algorithm 2:
double *A = new double[size];
for(int n = 0; n < size; n++)
A[n] = c[n][n];
for(int i = 0; i < size; i++)
{
for(int j = i + 1; j < size; j++)
{
c[i][j] -= K * A[j];
}
}
The size of my array is more than 1000x1000. Benchmarking on my laptop shows Algorithm 2 is better than 1 for size 5000x5000. Please note that I have multi threaded my application such that a set of rows is operated by a thread.
For example: For array of size 1000x1000.
thread1 -> row 0 to row 249
thread2 -> row 250 to row 499
thread3 -> row 500 to row 749
thread4 -> row 750 to row 999
Upvotes: 1
Views: 978
Reputation: 19706
Algorithm2 benefits from what's called "spatial locality", moving the diagonal into a single dimension array makes it reside in memory in consecutive addresses, and thereby:
Enjoys the benefit of fetching multiple useful elements per a single cache line (presumably 64byte, depending on your CPU), better utilizing cache and memory BW (whereas c[n][n] would also fetch a lot of useless data since it's in the same lines).
Enjoys the benefits of a HW stream prefetchers (assuming such exist in your CPU), that aggressively run ahead of your code along the page and brings the data in advance to the lower cache levels, improving the memory latency.
It should be pointed that moving the data to A doesn't necessarily improve cacheability since A would still compete against a lot of data constantly coming from c and thrashing the cache. However, since it's used over and over, there's a high chance that a good LRU algorithm would make it stay in the cache anyway. You could help that by using streaming memory operations for array c. It should be noted that these are very volatile performance tools, and may on some scenarios lead to perf reduction if not used correctly.
Another potential benefit could come from mixing SW prefetches slightly ahead of reaching every new array line.
Upvotes: 2
Reputation: 129314
If your benchmarks show significant improvement for the second case, then it most likely is the better choice. But of course, to know for "an average CPU", we'd have to know that for a large number of CPU's that can be called average - there is no other way. And it will really depend on the definition of Average CPU. Are we talking "any x86 (AMD + Intel) CPU" or "Any random CPU that we can find in anything from a watch to the latest super-fast creation in the x86 range"?
The "copy the data in c[n][n]
" method helps because it gets its own address, and doesn't get thrown out of the (L1) cache when the code walks its way over the larger matrix [and all the data you need for the multiplication is "close together". If you walk c[j][j]
, every j
steps will jump sizeof(double) * (size * j + 1)
bytes per iteration, so if size is anything more than 4, the next item needed wont be in the same cache-line, so another memory read is needed to get that data.
In other words, for anything that has a decent size cache (bigger than size * sizeof(double)
), it's a definite benefit. Even with smaller cache, it's quite likely SOME benefit, but the chances are higher that the cached copy will be thrown out by some part of c[i][j]
.
In summary, the second algorithm is very likely better for nearly all options.
Upvotes: 3