Why does OpenMP SLOW down my code proportional to number of threads

Question

I have a code I'm trying to parallelize, but I'm finding that the number of threads I use gives me an approximately 0.5x slow down. For instance, I use 4 threads and it runs twice as slow.

-edit: sorry, had the wrong portion of the program in here before.

 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!Compute Dynamic Structure Factor of Q,T=const
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
SUBROUTINE COMPUTE_SQ(i_vmax,ION_COUNT,ions,t2,lines,lf,rx,ry,rz,r,x,y,z,s,simtype,vmax,q)

IMPLICIT NONE

Integer*4, INTENT(IN)                   ::i_vmax,ION_COUNT,ions,t2,lines
Real*8,    INTENT(IN)                   ::lf,vmax
Integer*4, INTENT(IN), DIMENSION(1:51200)       ::rx,ry,rz
Real*4,    INTENT(IN), DIMENSION(1:51200,0:1000)    ::x,y,z
Real*8,    INTENT(INOUT),DIMENSION(1:i_vmax)        ::s
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2)      ::r
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2)      ::q
Real*8,            DIMENSION(1:i_vmax)      ::si,co
Integer*4                       ::k,i,p_start,p_end
Real*8                          ::dotprod,co_temp,si_temp
Character*5,INTENT(IN)                  ::simtype

!!!!!!RE-INITIALIZE VARIABLES

Do 300 k=1,i_vmax
    !if (mag(k).gt.vmax) then
    !cycle
    !endif
    co(k)=0
    si(k)=0
    co_temp=0
    si_temp=0
    write(*,*) vmax 
300 continue

!!!!!!!!!!!!!!!!!!!!

if (simtype.eq.'pfrac') then
    p_start=30721
    p_end=51200
else if (simtype.eq.'nfrac') then
    p_start=0
    p_end=30720
else
    write (*,*) 'simtype not specified'
endif

!!!!!!!!!!!!!!!!!!!!!!

Do 31 k=1,i_vmax
!    if (mag(k).gt.vmax) then
!    cycle
!   endif
    co_temp=0
    si_temp=0

    !$OMP PARALLEL DO PRIVATE(dotprod,Qcur,co_temp,si_temp)
        Do 41 i=p_start,p_end
         dotprod=(rx(k)*x(i,t2)+ry(k)*y(i,t2)+rz(k)*z(i,t2))*lf
         co_temp=co_temp+COS(dotprod) !Qcur/Qavg
         si_temp=si_temp+SIN(dotprod) !Qcur/Qavg
    41 continue
    !$OMP END PARALLEL DO
     q(k,2)= cmplx(co_temp,si_temp)
     r(k,2)=r(k,2)+q(k,2)
     s(k)=s(k) +(q(k,1) * conjg(q(k,2)))
     s(k)=s(k)/(p_end-p_start+1)**2
     !r(k,2)=r(k,2)/(p_end-p_start+1)
31 continue
RETURN
END SUBROUTINE COMPUTE_SQ

Here is the portion of the code that's relevant. At first I had the OMP part over the entire subroutine, but I thought they all might be trying to read the same values and it was slowing it down, but that doesn't seem to be the case since its the same speed regardless of which loop it goes over.

For reference, the inner loop is over about 20,000 iterations and the outer about 1000.

I'm using the Intel compiler 4.1.40 with the flags -mcmodel=medium -shared-intel (since it uses >2GB of mem) and -openmp of course. I've tried it with 1,2,4,8,16 cores and each successive doubling of cores gives me ~ 1.5x the time to run.

Any ideas appreciated!

Why does OpenMP SLOW down my code proportional to number of threads

Answers (1)

Related Questions