openMP not improving runtime

Question

I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical! I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...

Code sample 1: (significant improvement)

    !$OMP PARALLEL DO
    DO IN2=1,NN(2)
        DO IN1=1,NN(1)
            SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
            UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
        ENDDO
    ENDDO
    !$OMP END PARALLEL DO

Code sample 2: (no improvement)

    !$OMP PARALLEL DO
    DO IN2=1,NN(2)
        DO IN1=1,NN(1)
            SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
            SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
            SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
            UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
        ENDDO
    ENDDO        
    !$OMP END PARALLEL DO

I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.

I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.

Can anyone shed some light onto what I can to do see a real speed boost in the second one?

Data on speed boost for code sample 1:

Average runtime (for the whole code not just this snippet)

Without openMP tags: 2m 21.321s 

With openMP tags: 2m 20.640s

Average runtime (profile for just this snippet)

Without openMP tags: 6.3s

With openMP tags: 4.75s

Data on speed boost for code sample 2:

Average runtime (for the whole code not just this snippet)

Without openMP tags: 4m 46.659s

With openMP tags: 4m 49.200s

Average runtime (profile for just this snippet)

Without openMP tags: 15.14s

With openMP tags: 46.63s

PetrH · Accepted Answer

The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.

The SCATT array is shared and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.

The reason is that each thread loads a portion of the array SCATT in cache and whenever another thread writes in that portion of SCATT this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.

The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT at all. Just replace

SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0

with

SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0

where SCATT0 is a private variable.

And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2)) it very likely stored it in a register and used this value instead of SCATT(IN1,IN2) in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0.

Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare around the last line)

DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
    temp = (IN2-1)*NN(1)
    SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0

You can do something similar with snippet 2 as well.

openMP not improving runtime

Answers (1)

Related Questions