Reputation: 4946
I'm reading cpumemory.pdf from Ulrich Drepper and I'm unable to understand following part about optimizing cache access in matrix multiplication from chapter 6.2.1 (page 49-50):
First naive method for matrix multiplication is shown:
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < N; ++k)
res[i][j] += mul1[i][k] * mul2[k][j];
mul2
is accessed by columns so for each column one cache line is wasted. Ulrich says:
With sizeof(double) being 8 this means that, to fully utilize the cache line, we should unroll the middle loop 8 times.
For brevity I unrolled middle loop only 2 times.
for (i = 0; i < N; ++i)
for (j = 0; j < N; j += 2)
for (k = 0; k < N; ++k) {
res[i][j+0] += mul1[i][k] * mul2[k][j+0];
res[i][j+1] += mul1[i][k] * mul2[k][j+1];
}
Now it's obvious that if cache line is 2 double values wide it'll be fully utilized. But then Ulrich continues:
Continuing this thought, to effectively use the res matrix as well, i.e., to write 8 results at the same time, we should unroll the outer loop 8 times as well.
For brevity I unrolled outer loop only 2 times again.
for (i = 0; i < N; i += 2)
for (j = 0; j < N; j+=2)
for (k = 0; k < N; ++k) {
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
res[i+1][j+0] += mul1[i+1][k] * mul2[k][j+0];
res[i+1][j+1] += mul1[i+1][k] * mul2[k][j+1];
}
To me it seems even worse than previous version because now mul1
is accessed
by columns. Please explain what Ulrich meant.
Upvotes: 0
Views: 416
Reputation: 1013
There are three matrices inside the cache: the left input, the right input and the result.
The left input is accessed just fine by original code because it is row-major, and the innermost loop increments k, so it marches down the cache line.. the second matrix is accessed well by the single unrolling, because now all the columns in a cache line are used before the cache line is evicted..
The question is the result matrix.. it is also row-major, but the cache line is indexed by j, not by k.. and you are right.. j has already been unrolled, so it uses all the elements on a cache line within the result matrix.. so there doesn't appear to be anything gained by the second unrolling.. all it does is add two extra cache lines.. an extra for the left matrix and an extra for the result matrix! It doesn't improve the coverage of elements of any cache lines!
However, it does happen to reuse the right matrix's cache line twice.. that reduces the total number of times the lines of the right matrix have to be brought in.. and it does not increase the number of times the left and right matrix cache lines will be brought in.. so perhaps that reuse of the entire line is where the advantage comes from.. I guess the question is whether this is properly blocked to the cache size, and what the set associativity of the cache is.. if all lines of all three matrices stay in the cache, then this has no advantage.. (but it doesn't make anything worse!)
Upvotes: 1