How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.

The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.

As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.

Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?

I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core. Thanks all!

!$OMP PARALLEL DO

Do 500, k=1,n

Do 10, i=-(b-1),b-1
 Do 20, j=-(b-1),b-1
 if (abs(i).le.l.and.abs(j).eq.d) then
    cycle
 endif
 v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))

 if (k.eq.n-1) then
    vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
    vtest(i,j,2)=v(i,j)
endif
 20 continue
10 continue

500 continue

!$OMP END PARALLEL DO

Upvotes: 3

Answers (1)

High Performance Mark

Reputation: 78316

You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line

v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))

which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.

You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.

OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.

It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.

How to fix this ?

Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements

!$omp parallel do
do i=-(b-1),b-1
....
end do

ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.

In passing, testing

if (k==n-1) then

and

if (k==n) then

in every iteration looks like you are tying an anchor to your program, why not just

do k=1,n-2

and deal with the updates to vtest at the end of the loop.

You could separate the !$omp parallel do like this

!$omp parallel
do k=1,n-2

    !$omp do
    do i=-(b-1),b-1

(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.

Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement

v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))

will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

Upvotes: 4

How to remove Fortran race condition?

Answers (1)

Related Questions