How to parallelize a sum? Different results from a paralllel loop and from worshare

Question

As the title states, I want to parallelize a sum using OpenMP. I searched for different approaches but I either do not understand what they do or they didn't work. Here's what I found:

1)

!$OMP PARALLEL WORKSHARE
P_pump_t = 0.5d0 * dcv / pi**2 * sum( k * p_pump_k * dk )
!$OMP END PARALLEL WORKSHARE

Works, but I dont understand what happens and what benefit I get.

2)

!$OMP PARALLEL DO REDUCTION(+:P_pump_t)
do l = 1, n
P_pump_t = P_pump_t + 0.5d0 * dcv / pi**2 * k(l) * p_pump_k(l) * dk(l)
end do
!$OMP END PARALLEL DO

Gives wrong (different from 1) or 3)) results.

3) Of course I could compute a new array (parallelized) and let this one in the end summed up...

A hint on how to do it best?

Hristo Iliev · Accepted Answer

Based on the amount of code that you share, I would guess that "but I dont 2)" means that the loop version gives incorrect (different?) results. This could be if you omitted the initialisation of P_pump_t to 0.0 before the summation loop. Also note that both codes might produce slightly different results because of the non-associativity of floating-point operations - for example, (a+b)+c might produce a slightly different result from a+(b+c) because of the rounding and normalisation applied after each operation. Something like this would better match the vectorised version of your code:

P_pump_t = 0.0
!$OMP PARALLEL DO REDUCTION(+:P_pump_t)
do l = 1, n
  P_pump_t = P_pump_t + k(l) * p_pump_k(l) * dk(l)
end do
!$OMP END PARALLEL DO
P_pump_t = 0.5d0 * dcv / pi**2 * P_pump_t

It is quite possible that ifort already does extract the common multiplication after the loop - it is pretty good at performing such optimisations.

Also note that with Intel's OpenMP implementation the WORKSHARE directive is simply translated to SINGLE, i.e. the code actually runs in serial and on 32-bit machines that use x87 FPU instructions one can expect different results from the serial version than from the multithreaded one because of the higher internal precision of the x87 FPU.

How to parallelize a sum? Different results from a paralllel loop and from worshare

Answers (1)

Related Questions