Reputation: 2270
I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical! I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...
Code sample 1: (significant improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
Code sample 2: (no improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.
I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.
Can anyone shed some light onto what I can to do see a real speed boost in the second one?
Data on speed boost for code sample 1:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 2m 21.321s
With openMP tags: 2m 20.640s
Average runtime (profile for just this snippet)
Without openMP tags: 6.3s
With openMP tags: 4.75s
Data on speed boost for code sample 2:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 4m 46.659s
With openMP tags: 4m 49.200s
Average runtime (profile for just this snippet)
Without openMP tags: 15.14s
With openMP tags: 46.63s
Upvotes: 0
Views: 465
Reputation: 730
The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.
The SCATT
array is shared
and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.
The reason is that each thread loads a portion of the array SCATT
in cache and whenever another thread writes in that portion of SCATT
this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT
) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.
The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT
at all. Just replace
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
with
SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0
where SCATT0
is a private
variable.
And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
it very likely stored it in a register and used this value instead of SCATT(IN1,IN2)
in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
.
Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare
around the last line)
DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
temp = (IN2-1)*NN(1)
SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0
You can do something similar with snippet 2 as well.
Upvotes: 2