Reputation: 136
I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!
Upvotes: 0
Views: 2107
Reputation: 1684
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).
Upvotes: 0
Reputation: 4059
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
Upvotes: 3
Reputation: 78364
This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i
and leave them to it. There will be no determinism in the relative order in which threads update arr
. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100
(or whatever) and thread 2 is updating elements with i = 102,103,104,...,200
the program does not determine whether thread 1 updates arr[i,:] = 100
before or after thread 2 wants to use the updated values in arr
. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr
in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr
and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr
doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old]
the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
Upvotes: 3
Reputation: 8489
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
Upvotes: 3