Reputation: 868
I have a nested loop: (L and A are fully defined inputs)
#pragma omp parallel for schedule(guided) shared(L,A) \
reduction(+:dummy)
for (i=k+1;i<row;i++){
for (n=0;n<k;n++){
#pragma omp atomic
dummy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummy)/L[k][k];
}
dummy = 0;
}
And its sequential version:
for (i=k+1;i<row;i++){
for (n=0;n<k;n++){
dummy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummy)/L[k][k];
}
dummy = 0;
}
They both give different results. And parallel version is much slower than the sequential version.
What may cause the problem?
Edit:
To get rid of the problems caused by the atomic directive, I modified the code as follows:
#pragma omp parallel for schedule(guided) shared(L,A) \
private(i)
for (i=k+1;i<row;i++){
double dummyy = 0;
for (n=0;n<k;n++){
dummyy += L[i][n]*L[k][n];
L[i][k] = (A[i][k] - dummyy)/L[k][k];
}
}
But it also didn't work out the problem. Results are still different.
Upvotes: 2
Views: 1006
Reputation: 2318
The difference in results comes from the inner loop variable n
, which is shared between threads, since it is defined outside of the omp pragma.
Clarified:
The loop variable n
should be declared inside the omp pragma, since it should be thread-specific, for example for (int n = 0;.....)
Upvotes: 1
Reputation: 78306
In your parallel version you've inserted an unnecessary (and possibly harmful) atomic directive. Once you've declared dummy
to be a reduction variable OpenMP takes care of stopping the threads interfering in the reduction. I think the main impact of the unnecessary directive is to slow your code down, a lot.
I see you have another answer addressing the wrongness of your results. But I notice that you seem to set dummy
to 0
at the end of each outer loop iteration, which seems strange if you are trying to use it as some kind of accumulator, which is what the reduction clause suggests. Perhaps you want to reduce to dummy
across the inner loop ?
If you are having problems with reduction read this.
Upvotes: 2
Reputation: 116246
I am not very familiar with OpenMP but it seems to me that your calculations are not order-independent. Namely, the result in the inner loop is written into L[i][k]
where i
and k
are invariants for the inner loop. This means that the same value is overwritten k
times during the inner loop, resulting in a race condition.
Moreover, dummy
seems to be shared between the different threads, so there might be a race condition there too, unless your pragma parameters somehow prevent it.
Altogether, to me it looks like the calculations in the inner loop must be performed in the same sequential order, if you want the same result as given by the sequential execution. Thus only the outer loop can be parallelized.
Upvotes: 2