cpumemory.pdf - cache optimized matrix multiplication

Question

I'm reading cpumemory.pdf from Ulrich Drepper and I'm unable to understand following part about optimizing cache access in matrix multiplication from chapter 6.2.1 (page 49-50):

First naive method for matrix multiplication is shown:

for (i = 0; i < N; ++i)
    for (j = 0; j < N; ++j)
        for (k = 0; k < N; ++k)
            res[i][j] += mul1[i][k] * mul2[k][j];

mul2 is accessed by columns so for each column one cache line is wasted. Ulrich says:

With sizeof(double) being 8 this means that, to fully utilize the cache line, we should unroll the middle loop 8 times.

For brevity I unrolled middle loop only 2 times.

for (i = 0; i < N; ++i)
    for (j = 0; j < N; j += 2)
        for (k = 0; k < N; ++k) {
            res[i][j+0] += mul1[i][k] * mul2[k][j+0];
            res[i][j+1] += mul1[i][k] * mul2[k][j+1];
        }

Now it's obvious that if cache line is 2 double values wide it'll be fully utilized. But then Ulrich continues:

Continuing this thought, to effectively use the res matrix as well, i.e., to write 8 results at the same time, we should unroll the outer loop 8 times as well.

For brevity I unrolled outer loop only 2 times again.

for (i = 0; i < N; i += 2)
    for (j = 0; j < N; j+=2)
        for (k = 0; k < N; ++k) {
            res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
            res[i+0][j+0] += mul1[i+0][k] * mul2[k][j+0];
            res[i+1][j+0] += mul1[i+1][k] * mul2[k][j+0];
            res[i+1][j+1] += mul1[i+1][k] * mul2[k][j+1];
        }

To me it seems even worse than previous version because now mul1 is accessed by columns. Please explain what Ulrich meant.

cpumemory.pdf - cache optimized matrix multiplication

Answers (1)

Related Questions