Reputation: 1589
As the title states, I want to parallelize a sum using OpenMP. I searched for different approaches but I either do not understand what they do or they didn't work. Here's what I found:
1)
!$OMP PARALLEL WORKSHARE
P_pump_t = 0.5d0 * dcv / pi**2 * sum( k * p_pump_k * dk )
!$OMP END PARALLEL WORKSHARE
Works, but I dont understand what happens and what benefit I get.
2)
!$OMP PARALLEL DO REDUCTION(+:P_pump_t)
do l = 1, n
P_pump_t = P_pump_t + 0.5d0 * dcv / pi**2 * k(l) * p_pump_k(l) * dk(l)
end do
!$OMP END PARALLEL DO
Gives wrong (different from 1) or 3)) results.
3) Of course I could compute a new array (parallelized) and let this one in the end summed up...
A hint on how to do it best?
Upvotes: 1
Views: 1821
Reputation: 74405
Based on the amount of code that you share, I would guess that "but I dont 2)" means that the loop version gives incorrect (different?) results. This could be if you omitted the initialisation of P_pump_t
to 0.0
before the summation loop. Also note that both codes might produce slightly different results because of the non-associativity of floating-point operations - for example, (a+b)+c
might produce a slightly different result from a+(b+c)
because of the rounding and normalisation applied after each operation. Something like this would better match the vectorised version of your code:
P_pump_t = 0.0
!$OMP PARALLEL DO REDUCTION(+:P_pump_t)
do l = 1, n
P_pump_t = P_pump_t + k(l) * p_pump_k(l) * dk(l)
end do
!$OMP END PARALLEL DO
P_pump_t = 0.5d0 * dcv / pi**2 * P_pump_t
It is quite possible that ifort
already does extract the common multiplication after the loop - it is pretty good at performing such optimisations.
Also note that with Intel's OpenMP implementation the WORKSHARE
directive is simply translated to SINGLE
, i.e. the code actually runs in serial and on 32-bit machines that use x87 FPU instructions one can expect different results from the serial version than from the multithreaded one because of the higher internal precision of the x87 FPU.
Upvotes: 2