HARUN SASMAZ
HARUN SASMAZ

Reputation: 609

Parallelization of true dependencies

is it worth to parallelize true dependency loops? What could be the pros and cons? How much speedup can we get on average?

For example:

    int sum = 0;
    for(i=0;i<2000-1;i++){
            for(j=0;j<2000;j++) {
                curr[i][j] = some_value_here;
                sum += curr[i][j];
            }
    }

How should I approach to this loop? there is a obvious RAW dependency, should I parallelize it? If so, how should I?

Upvotes: 0

Views: 298

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74465

sum acts as a simple accumulator and this whole operation is a parallel reduction. The proper solution is to have each thread accumulate its own private sum and then add all private sums together at the end. OpenMP provides the reduction clause that does exactly that:

int sum = 0;
#pragma omp parallel for collapse(2) reduction(+:sum)
for(i=0;i<2000-1;i++){
        for(j=0;j<2000;j++) {
            curr[i][j] = some_value_here;
            sum += curr[i][j];
        }
}

reduction(+:sum) tells the compiler to create private copies of sum and then apply the + operator to reduce those private copies to a single value that is then added to the value sum had before the region. The code is roughly equivalent to:

int sum = 0;
#pragma omp parallel
{
   int localsum = 0;
   #pragma omp for collapse(2)
   for(i=0;i<2000-1;i++) {
      for(j=0;j<2000;j++) {
         curr[i][j] = some_value_here;
         localsum += curr[i][j];
      }
   }
   #pragma omp atomic
   sum += localsum;
}

The potential speedup here is equal to the number of execution units provided that you have one thread per execution unit and that there aren't that many threads so that the synchronous summation at the end of the parallel region takes negligible time.

Upvotes: 2

Related Questions