Reputation: 99
I have a code I'm trying to parallelize, but I'm finding that the number of threads I use gives me an approximately 0.5x slow down. For instance, I use 4 threads and it runs twice as slow.
-edit: sorry, had the wrong portion of the program in here before.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!Compute Dynamic Structure Factor of Q,T=const
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
SUBROUTINE COMPUTE_SQ(i_vmax,ION_COUNT,ions,t2,lines,lf,rx,ry,rz,r,x,y,z,s,simtype,vmax,q)
IMPLICIT NONE
Integer*4, INTENT(IN) ::i_vmax,ION_COUNT,ions,t2,lines
Real*8, INTENT(IN) ::lf,vmax
Integer*4, INTENT(IN), DIMENSION(1:51200) ::rx,ry,rz
Real*4, INTENT(IN), DIMENSION(1:51200,0:1000) ::x,y,z
Real*8, INTENT(INOUT),DIMENSION(1:i_vmax) ::s
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2) ::r
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2) ::q
Real*8, DIMENSION(1:i_vmax) ::si,co
Integer*4 ::k,i,p_start,p_end
Real*8 ::dotprod,co_temp,si_temp
Character*5,INTENT(IN) ::simtype
!!!!!!RE-INITIALIZE VARIABLES
Do 300 k=1,i_vmax
!if (mag(k).gt.vmax) then
!cycle
!endif
co(k)=0
si(k)=0
co_temp=0
si_temp=0
write(*,*) vmax
300 continue
!!!!!!!!!!!!!!!!!!!!
if (simtype.eq.'pfrac') then
p_start=30721
p_end=51200
else if (simtype.eq.'nfrac') then
p_start=0
p_end=30720
else
write (*,*) 'simtype not specified'
endif
!!!!!!!!!!!!!!!!!!!!!!
Do 31 k=1,i_vmax
! if (mag(k).gt.vmax) then
! cycle
! endif
co_temp=0
si_temp=0
!$OMP PARALLEL DO PRIVATE(dotprod,Qcur,co_temp,si_temp)
Do 41 i=p_start,p_end
dotprod=(rx(k)*x(i,t2)+ry(k)*y(i,t2)+rz(k)*z(i,t2))*lf
co_temp=co_temp+COS(dotprod) !Qcur/Qavg
si_temp=si_temp+SIN(dotprod) !Qcur/Qavg
41 continue
!$OMP END PARALLEL DO
q(k,2)= cmplx(co_temp,si_temp)
r(k,2)=r(k,2)+q(k,2)
s(k)=s(k) +(q(k,1) * conjg(q(k,2)))
s(k)=s(k)/(p_end-p_start+1)**2
!r(k,2)=r(k,2)/(p_end-p_start+1)
31 continue
RETURN
END SUBROUTINE COMPUTE_SQ
Here is the portion of the code that's relevant. At first I had the OMP part over the entire subroutine, but I thought they all might be trying to read the same values and it was slowing it down, but that doesn't seem to be the case since its the same speed regardless of which loop it goes over.
For reference, the inner loop is over about 20,000 iterations and the outer about 1000.
I'm using the Intel compiler 4.1.40 with the flags -mcmodel=medium -shared-intel (since it uses >2GB of mem) and -openmp of course. I've tried it with 1,2,4,8,16 cores and each successive doubling of cores gives me ~ 1.5x the time to run.
Any ideas appreciated!
Upvotes: 0
Views: 406
Reputation: 60088
Are you sure you are getting correct results at all? The values of the private variables are undefined on entering the parallel region and after exiting it. In your case the problematic ones are si_temp
and co_temp
. You should use the REDUCTION
clause instead.
Upvotes: 2