How does reduction operations in OpenMP work under the hood?

Question

I am trying to optimize the performance in a parallel-for loop where I have a reduction variable (called delta) and I am wondering how this is handled under the hood by the OpenMP library.

Lets take as an example the following piece of code, where I simply declare the variable as a reduction one, at the beginning of the loop as follows:

#pragma omp parallel shared(delta, A, B, rows, colms) private(i, j)
.
.
.
    #pragma omp for reduction(+:delta)
            for (i=1; i<=rows; i++){
                for (j=1; j<=colms; j++){
                delta += fabs(A[i][j]- B[i][j]);
                }
    }
.
.
.
//end of parallel region

I am wondering whether during the calculation each thread sets a lock when accessing the delta variable and furthermore whether I could increase the performance by replacing delta variable with an array delta[number_of_threads], where each thread will write in a different position of the array during the calculation and then sum-up all the elements after the parallel region.

Michael Klemm · Accepted Answer

Each thread will have its own copy of 'delta' on its stack frame:

#pragma omp parallel shared(delta, A, B, rows, colms) private(i, j)
{
    double local_delta; // one copy per thread

    __omp_init_schedule(1, rows, &lb, &ub);
    for (i=lb; i<=ub; i++) {
        for (j=1; j<=colms; j++) {
                local_delta += fabs(A[i][j]- B[i][j]);
        }
    }
   __omp_reduce(&delta, local_delta);  // accumulate thread's delta with shared var
   __omp_barrier();                    // do the barrier of the for construct
}

Please take the above as pseudo-code. The actual code pattern will depend on the implementation, inlining, and all sorts of other optimization an OpenMP implementation might do. If you want to read a bit about how things work please have a look at [1] and [2].

The implementation of __omp_reduce() can either be something tree-based or sequential using either locks or atomic instructions. OpenMP implementations are usually rather smart and choose the right algorithm for the machine and/or the number of threads being used.

Doing the delta[numthreads] modification will likely reduce performance by more than 100x, as this is the typical example for false-sharing as delta[0] for thread 0 and delta[1] for thread one will be in the same cache line and this cause a lot of traffic on the cache and the memory. A better appraoch would be to introduce patting delta[numthreads * 8] (assuming that delta is 8 bytes), so that each thread gets its own cache line. However, then you still need to perform the final aggregation and likely the OpenMP implementation still does a better job.

[1] https://www.dontknow.de/openmp-stuff/the-thing-from-another-world-or-how-do-openmp-compilers-work-part-1/

[2] https://www.dontknow.de/openmp-stuff/thunk-you-very-much-or-how-do-openmp-compilers-work-part-2/

How does reduction operations in OpenMP work under the hood?

Answers (1)

Related Questions