Narender Koosukuntla
Narender Koosukuntla

Reputation: 11

Low Performance of Nested DO Loop using OpenMP for FORTRAN90

I am trying to parallel a portion of my code which is as follows

    !$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(xDim, yDim, ex, f, fplus)
    !$OMP DO
    DO j = 1, 8
        DO y=1, yDim
            ynew = y+ey(j)
            DO x=1, xDim
                xnew = x+ex(j)
                IF ((xnew >= 1 .AND. xnew <= xDim) .AND.  (ynew >= 1 .AND. ynew <= yDim))  f(xnew,ynew,j)=fplus(x,y,j)
            END DO
        END DO
    END DO
    !$OMP END DO
    !$OMP END PARALLEL

I am new to OpenMP and FORTRAN.. The single core gives better performance that the parallel code. Please suggest what mistake I am doing here..

Upvotes: 1

Views: 657

Answers (2)

FrenchKheldar
FrenchKheldar

Reputation: 485

Your performance will also depend on the sizes of the loops. You have the correct arrangement of loops, with you right-most index on the outer loop for more optimized memory access. If these loops are smalls and all the memory can fit in the cache of a single processor, there will likely be no performance improvement from using OpenMP. As you saw, you can actually see a degradation of the performance because of the OpenMP overhead such as thread creation/destruction. And in the future, try to avoid IF statements inside nested loops, they will really hurt your performance !

Upvotes: 0

Jonathan Dursi
Jonathan Dursi

Reputation: 50927

The problem here is that you're just copying an array slice -- there's nothing really CPU limited here that splitting things up between cores will significantly help with. Ultimately this problem is memory bound, copying data from one piece of memory to another, and increasing the number of CPUs working at once likely only increases contention.

Having said that, I can get small (~10%) speedups if I rework the loop a bit to get that if statement out from inside the loop. This:

CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
    DO y=1+ey(j), yDim
        DO x=1+ex(j), xDim
            f(x,y,j)=fplus(x-ex(j),y-ey(j),j)
        END DO
    END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
time2 = tock(clock)

or this:

CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
    f(1+ex(j):xDim, 1+ey(j):yDim, j) = fplus(1:xDim-ex(j),1:yDim-ey(j),j)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
time3 = tock(clock)

make very modest improvements. If fplus was a function of the arguments x, y, and j and were compute intensive, things would be different; but a memory copy isn't likely to be sped up much.

Upvotes: 3

Related Questions