schuberm
schuberm

Reputation: 103

Loop through Array in inner loop with OpenMp

The snippet is a generic version of what I am trying to accomplish.

program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: array

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
 +        OMP_GET_THREAD_NUM

Nj = 100000
Nk = 100
allocate(array(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"

array = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID)
!$OMP DO
do j=1, Nj

!print *, "id", OMP_GET_THREAD_NUM()
!DO COMPUTATIONALLY INTENSIVE PART


    do k=1, Nk
        array(k)=array(k)+1
    enddo

enddo
!$OMP END DO
!$OMP END PARALLEL

print *, array
stop
end

In the non-openmp version every element of array will be 100000. With openmp as used in the snippet, I get array elements of about 99000. I'm not clear on what change I need to make to get the openmp version to yield the same output as the serial version.

EDIT:

The operations done in the outer loop have no dependence on each other, but the output of these operations needs be cumulatively tracked in a variable like array. So a temporary array for each thread which can then be combined once the outer loop is complete should work, but I don't know how to do the reduction. Does the code below make sense?

program main
integer, save:: j,Nj,k,Nk,AllocateStatus
!double precision:: array(100000)
double precision, dimension(:), allocatable:: a_tmp,a_total

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
 +        OMP_GET_THREAD_NUM

Nj = 100000
Nk = 100
allocate(a_tmp(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"

    allocate(a_total(Nk),STAT = AllocateStatus)
IF (AllocateStatus /= 0) STOP "*** Not enough memory ***"

a_tmp = 0.0
a_total = 0.0
!$OMP PARALLEL PRIVATE(NTHREADS, TID,a_tmp)
!$OMP DO
do j=1, Nj

!print *, "id", OMP_GET_THREAD_NUM()

    do k=1, Nk
        a_tmp(k)=a_tmp(k)+1
    enddo

enddo
!$OMP END DO
a_total=a_total+a_tmp
!$OMP END PARALLEL

print *, a_total
stop
end

Upvotes: 1

Views: 384

Answers (1)

High Performance Mark
High Performance Mark

Reputation: 78314

This is quite a generic answer.

Your loop is wonky. As you've written it the values of j will be shared out across threads equally, so thread 1 gets j=1..Nj/num_threads, thread 2 gets j=(Nj/num_threads)+1..2*Nj/num_threads and so on. But every thread will execute

do k=1, Nk

for all values 1..Nk. Which means that all threads will update array(k) and you have no control over when or how that happens. In particular, you can't prevent thread 1 reading a value, followed by thread 2 reading the same value, then adding 1 to it and writing it back, followed by thread 1 writing back its own value for the same element -- thereby missing one update to the value of the variable.

You have programmed a data-race. Nothing to be ashamed of, we've all done it.

How you remove that is another question and it depends on exactly what you are trying to achieve. You could, for example, parallelise over k rather than j. But the simple code you've shown us is, perhaps, too simple. I'd say that it is relatively unusual to write nested loops in OpenMP in which the contents of the inside loop do not incorporate, in some way, the value of the outer loop iterator (j in your case) so that they do not implement a data race.

For example

!$OMP DO
    do j=1, Nj
        do k=1, Nk
            array(j)=array(j)+1
        enddo
    enddo

does not have a data race; the compiler/run-time will take care of distributing the loop iterations and the multiple writes won't occur. On the other hand, it expresses a different operation than you've programmed.

Another approach would be to use the OpenMP collapse directive which would fuse the two loops and take care of eliminating the data race. Either your tutorial material or other Qs and As here on SO will show you how to use it.

Upvotes: 2

Related Questions