Shiyu
Shiyu

Reputation: 110

OpenMP not work with an array access indirect

I'm trying to optimize a loop in a parallel way with OpenMP.

  DOUBLE PRECISION, INTENT(INOUT) :: X(13541)
  INTEGER         , INTENT(IN)    :: NELEM,NELMAX !NELMAX = 25996 
  INTEGER         , INTENT(IN)    :: IKLE(NELMAX)
  DOUBLE PRECISION, INTENT(IN)    :: W(NELMAX)
!$OMP PARALLEL DO PRIVATE(IELEM) REDUCTION(+:X)
  DO IELEM = 1 , NELEM
    X(IKLE(IELEM)) = X(IKLE(IELEM)) + W(IELEM)
  ENDDO
!$OMP END PARALLEL DO

For any I and J different, it's possible that IKLE(I)=IKLE(J)

Increasing the number of the threads, I found that it takes even longer to run the loop. I use OMP_GET_WTIME() to time the job.

1 T  0.21943306922912598     
2 T  0.30610203742980957     
3 T  0.43688893318176270     
4 T  0.53783702850341797     
5 T  0.61055016517639160     
6 T  0.96715998649597168     
7 T  0.89582014083862305     
8 T   1.3073339462280273

I think the problem is caused by the array access indirect but I don't know how to deal with it in OpenMP

Upvotes: 2

Views: 171

Answers (1)

High Performance Mark
High Performance Mark

Reputation: 78334

If it's possible that IKLE(I)=IKLE(J) then you have a worse problem than irregular access trashing any hopes of efficient use of cached data, you have a data race -- there is no protection in your code against multiple threads writing to the same location 'simultaneously'. Depending on the frequency of occurrence of IKLE(I)=IKLE(J) you may get lucky and never experience the race in practice. Or you may get unlucky. Either way, as written, the program is a wrong 'un.

FWIW I agree with your suspicion that the irregular pattern of access to the elements of X is the root of the timing peculiarity you have reported, it requires a lot more movement of data through caches.

Also, while I'm writing, this line

  INTEGER, INTENT(IN) :: NELEM,NELMAX = 25996

is wrong too, it's a syntax error to try to initialise a routine argument.

EDIT, in response to OP's comment:

The dependence is surely another problem -- not to my way of thinking, the dependence (by which I take it you mean what I call the data race) makes the program wrong, broken, erroneous. The poor scaling is just an inconvenience. The problem arises in the OpenMP version of the program because two, or more, threads may try to update the same element of X at about the same time. The update is not atomic, the various operations that go on behind the scenes (read the data to a register, add the values in two registers together, write the data to a memory location, that sort of thing) may be interleaved in any way, and most of those interleavings will not leave the value in X as it would have been in a sequential execution of the program.

Would you give some hints about the movement of data you said? Not at this time, I've written enough.

Upvotes: 1

Related Questions