Reputation: 11
I am trying to parallel a portion of my code which is as follows
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(xDim, yDim, ex, f, fplus)
!$OMP DO
DO j = 1, 8
DO y=1, yDim
ynew = y+ey(j)
DO x=1, xDim
xnew = x+ex(j)
IF ((xnew >= 1 .AND. xnew <= xDim) .AND. (ynew >= 1 .AND. ynew <= yDim)) f(xnew,ynew,j)=fplus(x,y,j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
I am new to OpenMP and FORTRAN.. The single core gives better performance that the parallel code. Please suggest what mistake I am doing here..
Upvotes: 1
Views: 657
Reputation: 485
Your performance will also depend on the sizes of the loops. You have the correct arrangement of loops, with you right-most index on the outer loop for more optimized memory access. If these loops are smalls and all the memory can fit in the cache of a single processor, there will likely be no performance improvement from using OpenMP. As you saw, you can actually see a degradation of the performance because of the OpenMP overhead such as thread creation/destruction. And in the future, try to avoid IF statements inside nested loops, they will really hurt your performance !
Upvotes: 0
Reputation: 50927
The problem here is that you're just copying an array slice -- there's nothing really CPU limited here that splitting things up between cores will significantly help with. Ultimately this problem is memory bound, copying data from one piece of memory to another, and increasing the number of CPUs working at once likely only increases contention.
Having said that, I can get small (~10%) speedups if I rework the loop a bit to get that if statement out from inside the loop. This:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
DO y=1+ey(j), yDim
DO x=1+ex(j), xDim
f(x,y,j)=fplus(x-ex(j),y-ey(j),j)
END DO
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
time2 = tock(clock)
or this:
CALL tick(clock)
!$OMP PARALLEL PRIVATE(j,x,y,xnew, ynew) SHARED(ex, ey, f, fplus) DEFAULT(none)
!$OMP DO
DO j = 1, 8
f(1+ex(j):xDim, 1+ey(j):yDim, j) = fplus(1:xDim-ex(j),1:yDim-ey(j),j)
ENDDO
!$OMP END DO
!$OMP END PARALLEL
time3 = tock(clock)
make very modest improvements. If fplus was a function of the arguments x, y, and j and were compute intensive, things would be different; but a memory copy isn't likely to be sped up much.
Upvotes: 3