Reputation: 4765
How to do OpenMP reduction (sum) inside parallel region? (Result is needed on master thread only).
Algorithm prototype:
#pragma omp parallel
{
t = omp_get_thread_num();
while iterate
{
float f = get_local_result(t);
// fsum is required on master only
float fsum = // ? - SUM of f
if (t == 0):
MPI_Bcast(&fsum, ...);
}
If I have OpenMP region inside while iterate
loop, parallel region overhead at each iteration kills the performance...
Upvotes: 1
Views: 1258
Reputation: 50279
Here is the simplest way to do this:
float sharedFsum = 0.f;
float masterFsum;
#pragma omp parallel
{
const int t = omp_get_thread_num();
while(iteration_condition)
{
float f = get_local_result(t);
// Manual reduction
#pragma omp update
sharedFsum += f;
// Ensure the reduction is completed
#pragma omp barrier
#pragma omp master
MPI_Bcast(&sharedFsum, ...);
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
The atomic operations can be costly if you have a lot of threads (eg. hundreds). A better approach is to let the runtime perform the reduction for you. Here is a better version:
float sharedFsum = 0;
#pragma omp parallel
{
const int threadCount = omp_get_num_threads();
float masterFsum;
while(iteration_condition)
{
// Execute get_local_result on each thread and
// perform the reduction into sharedFsum
#pragma omp for reduction(+:sharedFsum) schedule(static,1)
for(int i=0 ; i<threadCount ; ++i)
sharedFsum += get_local_result(i);
#pragma omp master
{
MPI_Bcast(&sharedFsum, ...);
// sharedFsum must be reinitialized for the next iteration
sharedFsum = 0.f;
}
// Ensure no other threads update sharedFsum during the MPI_Bcast
#pragma omp barrier
}
}
Side notes:
t
is not protected in your code, use private(t)
in the #pragma omp parallel
section to avoid an undefined behavior due to a race condition. Alternatively, you can use scoped variables.
#pragma omp master
should be preferred to a conditional on the thread ID.
parallel region overhead at each iteration kills the performance...
Most of the time this is due to either (implicit) synchronizations/communications or a work imbalance.
The code above may have the same problem since it is quite synchronous.
If it makes sense in your application, you can make it a bit less synchronous (and thus possibly faster) by removing or moving barriers regarding the speed of the MPI_Bcast
and get_local_result
. However, this is far from being easy to do it correctly. One way to do that it to use OpenMP tasks and multi-buffering.
Upvotes: 1