user3166083
user3166083

Reputation: 161

Threads summing a variable giving wrong answer in OpenMP

To practice parallelizing the do loop, I am doing the following integral in Fortran

$\integral{0}{1} \frac{4}{1+x^{2}} = \pi$

The following is the code that I implemented:

      program mpintegrate

      integer i,nmax,nthreads,OMP_GET_NUM_THREADS
      real xn,dx,value
      real X(100000)

      nthreads = 4
      nmax = 100000
      xn = 0.0
      dx = 1.0/nmax
      value = 0.0

      do i=1,nmax
         X(i) = xn
         xn = xn + dx
      enddo

      call OMP_SET_NUM_THREADS(nthreads)

!$OMP Parallel 

!$OMP Do Schedule(Static) Private(i,X)

      do i=1,nmax
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo

!$OMP End DO NoWait

!$OMP End Parallel

      print *, value

      end

I have no problems compiling the program

gfortran -fopenmp -o mpintegrate mpintegrate.f

The problem is when I execute the program. When I run the program as is, I get values ranging from (1,4). However, when I uncomment the print statement withing the omp do loop, the final value is around what it should be, pi.

Why is the answer in value incorrect?

Upvotes: 2

Views: 1070

Answers (1)

Jonathan Dursi
Jonathan Dursi

Reputation: 50927

One problem here is that X needs to not be private (and which needs to be specified on the parallel line, not the do line); everyone needs to see it, and there's no point in having separate copies for each thread. Worse, the results you get from accessing the private copy here is undefined, as that private variable hasn't been initialized once you get into the private region. You could use firstprivate rather than private, which initializes it for you with what was there before the parallel region, but easiest/best here is just shared.

There's also not much point in having the end do be no wait, as the end parallel has to wait for everyone to be done anyway.

However, that being said, you still have a pretty major (and classic) correctness problem. What's happening here is clearer if you're a little more explicit in the loop (dropping the schedule for clarity since the issue doesn't depend on the schedule chosen):

!$OMP Parallel do Private(i) Default(none) Shared(value,X,dx,nmax)

      do i=1,nmax
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo

!$OMP End Parallel Do
print *, value

Running this repeatedly gives different values:

$ ./foo
   1.6643878
$ ./foo
   1.5004054
$ ./foo
   1.2746993

The problem is that all of the threads are writing to the same shared variable value. This is wrong - everyone is writing at once and the result is gibberish, as a thread can calculate it's own contribution, get ready to add it to value, and just as it's about to, another thread can do its writing to value, which then gets promptly clobbered. Concurrent writes to the same shared variable is a classic race condition, a standard family of bugs that happen particularly often in shared-memory programming like with OpenMP.

In addition to being wrong, it's slow. A number of threads contending for the same few bytes of memory - memory close enough together to fall in the same cache line - can be very slow because of contention in the memory system. Even if they aren't exactly the same variable (as they are in this case), this memory contention - False Sharing in the case that they only happen to be neighbouring variables - can significantly slow things down. Taking out the explicit thread-number setting, and using environment variables:

$ export OMP_NUM_THREADS=1
$ time ./foo
   3.1407621

real    0m0.003s
user    0m0.001s
sys 0m0.001s

$ export OMP_NUM_THREADS=2
$ time ./foo
   3.1224852

real    0m0.007s
user    0m0.012s
sys 0m0.000s

$ export OMP_NUM_THREADS=8
$ time ./foo
   1.1651508

real    0m0.008s
user    0m0.042s
sys 0m0.000s

So things get almost 3 times slower (and increasingly wronger) running with more threads.

So what can we do to fix this? One thing we could to is make sure that everyone's additions aren't overwriting each other, with the atomic directive:

!$OMP Parallel do Schedule(Static) Private(i) Default(none) Shared(X,dx, value, nmax)
      do i=1,nmax
!$OMP atomic
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo
!$OMP end parallel do

which solves the correctness problem:

$ export OMP_NUM_THREADS=8
$ ./foo
   3.1407621

but does nothing for the speed problem:

$ export OMP_NUM_THREADS=1
$ time ./foo
   3.1407621

real    0m0.004s
user    0m0.001s
sys 0m0.002s

$ export OMP_NUM_THREADS=2
$ time ./foo
   3.1407738

real    0m0.014s
user    0m0.023s
sys 0m0.001s

(Note you get slightly different answers with different numbers of threads. This is due to the final sum being calculated in a different order than in the serial case. With single precision reals, differences showing up in the 7th digit due to different ordering of operations is hard to avoid, and here we're doing 100,000 operations.)

So what else could we do? One approach is for everyone to keep track of their own partial sums, and then sum them all together when we're done:

!... 

      integer, parameter :: nthreads = 4
      integer, parameter :: space=8
      integer :: threadno
      real, dimension(nthreads*space) :: partials
!...

      partials=0

!...

!$OMP Parallel Private(value,i,threadno) Default(none) Shared(X,dx, partials)
      value = 0
      threadno = omp_get_thread_num()
!$OMP DO
      do i=1,nmax
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo
!$OMP END DO
      partials((threadno+1)*space) = value
!$OMP end parallel

      value = sum(partials)
      print *, value
      end

This works - we get the right answer, and if you play with the number of threads, you'll find it's pretty zippy - we've spaced out the entries in the partial sums array to avoid false sharing (and it is false, this time, as everyone is writing to a different entry in the array - no overwriting).

Still, this is a silly amount of work just to get a sum correct across threads! There's a simpler way to do this - OpenMP has a reduction construct to do this automatically (and more efficiently than this handmade version above:)

!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
      do i=1,nmax
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo
!$OMP end parallel do
      print *, value 

and now the program works correctly, is fast, and the code is fairly simple. The final code, in more modern Fortran, looks something like this:

      program mpintegrate
      use omp_lib
      integer, parameter :: nmax = 100000
      real :: xn,dx,value
      real :: X(nmax)
      integer :: i

      integer, parameter :: nthreads = 4

      xn = 0.0
      dx = 1.0/nmax
      value = 0.0
      partials=0

      do i=1,nmax
         X(i) = xn
         xn = xn + dx
      enddo

      call omp_set_num_threads(nthreads)

!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
      do i=1,nmax
         value = value + dx*(4.0/(1+X(i)*X(i)))
      enddo
!$OMP end parallel do
      print *, value

      end

Upvotes: 5

Related Questions