Reputation: 1536
I know this might look like a duplicate but since I am learning OpenMP for the first time and after going through multiple sources and posts, still I am confused so I decided to post a question itself.
I am learning OpenMP and while learning more on loop parallelism, I got to know that "Nested Parallelism" is disabled in OpenMP - source:
according to the article this code:
#pragma omp parallel for
for (int i = 0; i < 3; ++i) {
#pragma omp parallel for
for (int j = 0; j < 6; ++j) {
}
}
according to the article,
this does not work as two loops paralleled because when the second pragma is reached it is ignored by OpenMP.
while, in one of the StackOverflow answers to a similar question's post, I had read that this kind of making parallelism does not work because all the available threads are already reserved by upper iteration.
I don't understand if that second logic is correct because if it is so can we work with the number of threads specified?
Regarding the parallelization on nested for loop, I know we can use collapse which basically performs the task for us to collapse two nested iteration into a single iteration, but how can we parallelize this kind of loop which is not perfectly nested?
for (i=0; i<N; i++) {
y[i] = 0.;
for (j=0; j<N; j++)
y[i] += A[i][j] * x[j]
}
and the same source suggest that this loop can be written as:
for (i=0; i<N; i++)
y[i] = 0.;
for (i=0; i<N; i++) {
for (j=0; j<N; j++)
y[i] += A[i][j] * x[j]
}
since now this is a correctly nested loop, can we use collapse directive in this kind of loop or is there any workaround?
I am confused on how to make parallel nested for loops which is not correctly nested: i.e
for (){
for(){
}
}
Upvotes: 0
Views: 1033
Reputation: 50808
You can use tasks to solve such kind of problem. Indeed, since OpenMP 3.0, OpenMP slowly moved away from the fork-join model to embrace task. Tasks still enable users to do fork-join based parallel loops but also enable it to parallelize much more complex cases. The OpenMP taskloop
is very useful to mimic the basic work-sharing loop construct with tasks (warning: 1 task per logical iteration may be created by default without specifying the number of task or their granularity). Note however that tasks often introduce a significant overhead compared to basic work-sharing loops (especially on many-core NUMA systems).
#pragma omp taskloop
for (i=0; i<N; i++) {
double sum = 0.; // Assume y[i] is of type double
#pragma omp taskloop reduction(+:sum) grainsize(32768)
for (j=0; j<N; j++)
sum += A[i][j] * x[j]
y[i] = sum;
}
However, I see no benefit of this implementation since providing more (thread-based) parallelism than the naive parallel for
only on the first i
-based loop should be sufficient and providing more parallelism often introduces a significant overhead (even with only a collapse(2)
in such a case).
Note that specifying to the compiler that SIMD instructions can be used in the j
-based loop with omp simd reduction(+:sum)
is probably a good idea (unrolling may help too).
Note that you can use an array-based reduction to solve this specific case, but such reduction are only supported quite recently and so may not be supported by your compiler. Moreover, it is probably not a good idea to use it if y
is big (as the array will be replicated for each thread).
Upvotes: 1