Reputation: 4765

OpenMP reduction from within parallel region

How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).

Algorithm prototype:

#pragma omp parallel
{
    t = omp_get_thread_num();

    while iterate 
    {
        float f = get_local_result(t);

        // fsum is required on master only
        float fsum = // ? - SUM of f

        if (t == 0):
            MPI_Bcast(&fsum, ...);
}

If I have OpenMP region inside while iterate loop, parallel region overhead at each iteration kills the performance...

Upvotes: 1

Answers (1)

Jérôme Richard

Reputation: 50279

Here is the simplest way to do this:

    float sharedFsum = 0.f;
    float masterFsum;

    #pragma omp parallel
    {
        const int t = omp_get_thread_num();

        while(iteration_condition)
        {
            float f = get_local_result(t);

            // Manual reduction
            #pragma omp update
            sharedFsum += f;

            // Ensure the reduction is completed
            #pragma omp barrier

            #pragma omp master
            MPI_Bcast(&sharedFsum, ...);

            // Ensure no other threads update sharedFsum during the MPI_Bcast
            #pragma omp barrier
        }
    }

The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you. Here is a better version:

    float sharedFsum = 0;

    #pragma omp parallel
    {
        const int threadCount = omp_get_num_threads();
        float masterFsum;

        while(iteration_condition)
        {
            // Execute get_local_result on each thread and
            // perform the reduction into sharedFsum
            #pragma omp for reduction(+:sharedFsum) schedule(static,1)
            for(int i=0 ; i<threadCount ; ++i)
                sharedFsum += get_local_result(i);

            #pragma omp master
            {
                MPI_Bcast(&sharedFsum, ...);

                // sharedFsum must be reinitialized for the next iteration
                sharedFsum = 0.f;
            }

            // Ensure no other threads update sharedFsum during the MPI_Bcast
            #pragma omp barrier
        }
    }

Side notes:

t is not protected in your code, use private(t) in the #pragma omp parallel section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master should be preferred to a conditional on the thread ID.

parallel region overhead at each iteration kills the performance...

Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance. The code above may have the same problem since it is quite synchronous. If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast and get_local_result. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.

Upvotes: 1

OpenMP reduction from within parallel region

Answers (1)

Related Questions