Reputation: 103
The snippet is a generic version of what I am trying to accomplish.
program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: array
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM
Nj = 100000
Nk = 100
allocate(array(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
array = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID)
!$OMP DO
do j=1, Nj
!print *, "id", OMP_GET_THREAD_NUM()
!DO COMPUTATIONALLY INTENSIVE PART
do k=1, Nk
array(k)=array(k)+1
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
print *, array
stop
end
In the non-openmp version every element of array
will be 100000. With openmp as used in the snippet, I get array elements of about 99000. I'm not clear on what change I need to make to get the openmp version to yield the same output as the serial version.
EDIT:
The operations done in the outer loop have no dependence on each other, but the output of these operations needs be cumulatively tracked in a variable like array
. So a temporary array for each thread which can then be combined once the outer loop is complete should work, but I don't know how to do the reduction. Does the code below make sense?
program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: a_tmp,a_total
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM
Nj = 100000
Nk = 100
allocate(a_tmp(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
allocate(a_total(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"
a_tmp = 0.0
a_total = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID,a_tmp)
!$OMP DO
do j=1, Nj
!print *, "id", OMP_GET_THREAD_NUM()
do k=1, Nk
a_tmp(k)=a_tmp(k)+1
enddo
enddo
!$OMP END DO
a_total=a_total+a_tmp
!$OMP END PARALLEL
print *, a_total
stop
end
Upvotes: 1
Views: 384
Reputation: 78314
This is quite a generic answer.
Your loop is wonky. As you've written it the values of j
will be shared out across threads equally, so thread 1 gets j=1..Nj/num_threads
, thread 2 gets j=(Nj/num_threads)+1..2*Nj/num_threads
and so on. But every thread will execute
do k=1, Nk
for all values 1..Nk
. Which means that all threads will update array(k)
and you have no control over when or how that happens. In particular, you can't prevent thread 1 reading a value, followed by thread 2 reading the same value, then adding 1 to it and writing it back, followed by thread 1 writing back its own value for the same element -- thereby missing one update to the value of the variable.
You have programmed a data-race. Nothing to be ashamed of, we've all done it.
How you remove that is another question and it depends on exactly what you are trying to achieve. You could, for example, parallelise over k
rather than j
. But the simple code you've shown us is, perhaps, too simple. I'd say that it is relatively unusual to write nested loops in OpenMP in which the contents of the inside loop do not incorporate, in some way, the value of the outer loop iterator (j
in your case) so that they do not implement a data race.
For example
!$OMP DO
do j=1, Nj
do k=1, Nk
array(j)=array(j)+1
enddo
enddo
does not have a data race; the compiler/run-time will take care of distributing the loop iterations and the multiple writes won't occur. On the other hand, it expresses a different operation than you've programmed.
Another approach would be to use the OpenMP collapse
directive which would fuse the two loops and take care of eliminating the data race. Either your tutorial material or other Qs and As here on SO will show you how to use it.
Upvotes: 2