M. Zdunek
M. Zdunek

Reputation: 73

Not sure how to explain some of the performance results of my parallelized matrix multiplication code

I'm running this code in OpenMP for matrix multiplication and I measured its results:

#pragma omp for schedule(static)
for (int j = 0; j < COLUMNS; j++)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

There are different versions of the code based on where i put the #pragma omp directive - before the j loop, k loop, or the i loop. Also, for every one of those variants I ran different versions for default static scheduling, static scheduling with chunks 1 and 10 and dynamic scheduling with the same chunks. I also measured the number of DC accesses, DC misses, CPU clocks, retired instructions, and other performance indicators in CodeXL. Here are the results for the matrix of size 1000x1000 on AMD Phenom I X4 945:

Results of performance measurements

Where multiply_matrices_1_dynamic_1 is a function with #pragma omp before the first loop and dynamic scheduling with chunk 1, etc. Here are some things I don't quite understand about the results and would appreciate help:

Also, I'm confused about the relation of TLB misses to cache misses. When is DTLB used specifically? My professor's document says that every DC access is a DTLB request, but I don't understand how that works - the number of TLB misses is often bigger than the number of DC accesses. How do I compute TLB miss ratio? My professor says it's TBL misses / DC accesses. He also says I can compute temporal locality by the cache hit ratio and spatial locality by TLB hit ratio. How does that work exactly?

Upvotes: 1

Views: 330

Answers (2)

Z boson
Z boson

Reputation: 33669

Gilles has the right idea that your code is cache unfriendly but his solution still has a similar problem because it does the reduction over k on matrix_b[k][j].

One solution is to calculate a transpose of matrix_b then you can run over matrix_bT[j][k] over k which is cache friendly. The transpose goes as O(n^2)) and matrix multiplication as O(n^3) so the cost of the transpose goes 1/n. i.e. for large n it becomes negligible.

But there is an even easier solution than using a transpose. Do the reduction over j like this:

#pragma omp for schedule(static)
for (int i = 0; i < ROWS; i++ ) {
    for (int k = 0; k < COLUMNS; k++ ) {
        for ( int j = 0; j < COLUMNS; j++ ) {
           matrix_r[i][j] += matrix_a[i][k]*matrix_b[k][j];
        }
    }
}

Gilles' method requires two reads from memory per iteration whereas this solution requires two reads and a write to memory per iteration but it's much more cache friendly which more than makes up for the write to memory.

Upvotes: 2

Gilles
Gilles

Reputation: 9489

I'm not sure what your figures show, but what I'm sure is that your code, the way it is written at the moment, is almost as ineffective as can be. So discussing about details of this or that counter's figure will make very little sense until you have made the code sensibly effective.

The reason for which I claim your code is ineffective is because the order in which you organised your loops is probably the worst possible one: none of the accesses to your data are linear, leading to a super ineffective use of the cache. By simply swapping around loops, your should dramatically improve your performance and start looking at what can be done more to improve it further.

This version for example should already be much better (not tested):

#pragma omp for schedule( static )
for ( int i = 0; i < ROWS; i++ ) {
    for ( int j = 0; j < COLUMNS; j++ ) {
        auto res = matrix_r[i][j]; // IDK the type here
        #pragma omp simd reduction( + : res )
        for ( int k = 0; k < COLUMNS; k++ ) {
           res += matrix_a[i][k] * matrix_b[k][j];
        }
        matrix_r[i][j] = res;
    }
}

(NB: I added the simd directive just because it looked appropriate, but it was by no mean the point here)

From there, experimenting with loop collapsing, thread scheduling and/or loop tiling will start to make sense.

Upvotes: 1

Related Questions