Cache unfriendly loop over 2d-array faster than cache friendly loop

Question

Why is version 1 faster than version 2 when compiling with MSVC++?

Version 1:

for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        for (int k = 0; k < N; ++k)
            res1[i][j] += mat1[i][k] * mat2[k][j];

Version 2:

for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        for (int k = 0; k < N; ++k)
            res1[i][j] += mat1[i][k] * mat2[j][k];

(N=1000; res1,mat1,mat2 are double[N][N] arrays)

Shouldn't Version 2 be faster because within the loops indexing over mat2 with [j][k] is cache friendly (when loading mat2[j][k] from ram into the cache mat2[j][k+1], mat2[j][k+2],... will also be loaded because they are on same cachline))?

(If i turn off compiler optimization (using: "#pragma optimize( "", off )") version 2 is faster than version 1 but the code runs a lot slower (obviously)).

EDIT:

Performance: (time measured using windows.h ==> QueryPerformanceCounter)

With compiler optimization: Version 1: ~493 ms; Version2: 954 ms Without compiler optimization: Version1: ~3868 ms; Version 2: ~2266 ms

2501 · Accepted Answer

Using optimizations, for the first version, the compiler can obviously reorder the inner two loops into:

for (int i = 0; i < N; ++i)
    for (int k = 0; k < N; ++k)
        for (int j = 0; j < N; ++j)
            res1[i][j] += mat1[i][k] * mat2[k][j];

This will make the first version similar to the second in terms of cache awareness.

The reason why is the first version twice as fast, could probably be the caching of it's second term: mat1[i][k] since it, after doing above optimization, doesn't change in the inner loop.

Cache unfriendly loop over 2d-array faster than cache friendly loop

Answers (1)

Related Questions