Reputation: 11
I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:
double precision function ddot(n,dx,incx,dy,incy)
c
c forms the dot product of two vectors.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),dtemp
integer i,incx,incy,ix,iy,m,mp1,n
c
CALL OMP_SET_NUM_THREADS(12)
ddot = 0.0d0
dtemp = 0.0d0
if(n.le.0)return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dtemp = dtemp + dx(ix)*dy(iy)
ix = ix + incx
iy = iy + incy
10 continue
ddot = dtemp
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,5)
if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 30 i = 1,m
dtemp = dtemp + dx(i)*dy(i)
30 continue
!$OMP END PARALLEL DO
if( n .lt. 5 ) go to 60
40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
do 50 i = mp1,n,5
dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
* dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
50 continue
!$OMP END PARALLEL DO
60 ddot = dtemp
return
end
I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.
The Do loop higher up could be parallelized too, but I didn't know how to define the ix
and iy
and so just left it alone since it doesn't spend much time there.
Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.
Upvotes: 1
Views: 1731
Reputation: 724
There are a bunch of Intel Webinars you can look up help you.
I have aperture optimum histogram code that does OpenMP SIMD REDUCTION. So I decided to bring in the vector into a (big,nthreads) array and do each thread in a parallel region. Generally it runs slower than just using a single core. (I have not tried vtune on that yet)
Other similar array approaches with FFTs run faster, and the cores are all at 100% with good scaling.
Basically one either needs to work out the issues or test which works better. Any OpenMP parallel takes a long time to start, so you want it way outside, and not down at the tightest level.
Generally you can be better off with PURE functions or subroutines, and using
!DEC$ ATTRIBUTES VECTOR ...
In ifort16 there is also VECTOR(REF(variable)) and the reference is new. Once all that is singing, then the parallel can be attempted.
Your DO 50 would need some big numbers for 'i' in order to make parallelised code go faster, or the OpenMP parallel 'startup' will gobble up too much time.
There is not a lot other than vtune to aid finding cashe misses (etc) to give you insight into getting to having faster code (which is really code without slowdowns). After all that, then it may be worthwhile to also compile using gfortran. I generally find that running through two compilers gives more insight into making better overall code. But if you get gains from !DEC$ Extensions, then gfortran may not help. CONTIGUOUS can also be worth trying in functions.
Upvotes: 1