Parallel processing in Dot Product

Question

I am having a heck of a time trying to figure out how to get a simple Dot Product calculation to parallel process on a Fortran code compiled by the Intel ifort compiler v 16. I have the section of code below, it is part of a program used for a more complex process, but this is where most of the time is spent by the program:

double precision function ddot(n,dx,incx,dy,incy)
c
c     forms the dot product of two vectors.
c     uses unrolled loops for increments equal to one.
c     jack dongarra, linpack, 3/11/78.
c     modified 12/3/93, array(1) declarations changed to array(*)
c
      double precision dx(*),dy(*),dtemp
      integer i,incx,incy,ix,iy,m,mp1,n
c
      CALL OMP_SET_NUM_THREADS(12)
      ddot = 0.0d0
      dtemp = 0.0d0
      if(n.le.0)return
      if(incx.eq.1.and.incy.eq.1)go to 20
c
c        code for unequal increments or equal increments
c          not equal to 1
c
      ix = 1
      iy = 1
      if(incx.lt.0)ix = (-n+1)*incx + 1
      if(incy.lt.0)iy = (-n+1)*incy + 1
      do 10 i = 1,n
        dtemp = dtemp + dx(ix)*dy(iy)
        ix = ix + incx
        iy = iy + incy
   10 continue
      ddot = dtemp
      return
c
c        code for both increments equal to 1
c
c
c        clean-up loop
c
   20 m = mod(n,5)
      if( m .eq. 0 ) go to 40
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,m) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
      do 30 i = 1,m
        dtemp = dtemp + dx(i)*dy(i)
   30 continue
!$OMP END PARALLEL DO
      if( n .lt. 5 ) go to 60
   40 mp1 = m + 1
!$OMP PARALLEL DO
!$OMP& DEFAULT(NONE) SHARED(dx,dy,n,mp1) PRIVATE(i)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION( + : dtemp )
      do 50 i = mp1,n,5
        dtemp = dtemp + dx(i)*dy(i) + dx(i + 1)*dy(i + 1) +
     *   dx(i + 2)*dy(i + 2) + dx(i + 3)*dy(i + 3) + dx(i + 4)*dy(i + 4)
   50 continue
!$OMP END PARALLEL DO
   60 ddot = dtemp
      return
      end

I am new to the OpenMP commands and am pretty sure I have something funny in there that slows the whole thing down more than on a single core. Currently I have tried to run it on 4 threads on a slower 4(4) core machine where it actually went a bit faster than the large 20(40) core machine where we designated 12 threads for the processing. At this point I'm thinking the code is funny and doing something I don't want.

The Do loop higher up could be parallelized too, but I didn't know how to define the ix and iy and so just left it alone since it doesn't spend much time there.

Precision is very important, so the compiler is set to fp-mode precise. I don't know if that matters at all, but when the code does manage to generate answers they do appear correct. Basically, I'm just trying to figure out how to speed up this code, but instead parallel processing seems to slow down the process instead.

Parallel processing in Dot Product

Answers (1)

Related Questions