Reputation: 609
is it worth to parallelize true dependency loops? What could be the pros and cons? How much speedup can we get on average?
For example:
int sum = 0;
for(i=0;i<2000-1;i++){
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
sum += curr[i][j];
}
}
How should I approach to this loop? there is a obvious RAW dependency, should I parallelize it? If so, how should I?
Upvotes: 0
Views: 298
Reputation: 74465
sum
acts as a simple accumulator and this whole operation is a parallel reduction. The proper solution is to have each thread accumulate its own private sum and then add all private sums together at the end. OpenMP provides the reduction
clause that does exactly that:
int sum = 0;
#pragma omp parallel for collapse(2) reduction(+:sum)
for(i=0;i<2000-1;i++){
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
sum += curr[i][j];
}
}
reduction(+:sum)
tells the compiler to create private copies of sum
and then apply the +
operator to reduce those private copies to a single value that is then added to the value sum
had before the region. The code is roughly equivalent to:
int sum = 0;
#pragma omp parallel
{
int localsum = 0;
#pragma omp for collapse(2)
for(i=0;i<2000-1;i++) {
for(j=0;j<2000;j++) {
curr[i][j] = some_value_here;
localsum += curr[i][j];
}
}
#pragma omp atomic
sum += localsum;
}
The potential speedup here is equal to the number of execution units provided that you have one thread per execution unit and that there aren't that many threads so that the synchronous summation at the end of the parallel region takes negligible time.
Upvotes: 2