Reputation: 173
I'm trying to manage nested parallel regions with OpenMP (4.5, via GCC 7.2.0) and I'm having some issues turning off nesting.
Sample program:
#include <stdio.h>
#include <omp.h>
void foobar() {
int tid = omp_get_thread_num();
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
int otid = omp_get_thread_num();
printf("%d | %d\n", tid, otid);
}
}
int main(void) {
omp_set_nested(0);
#pragma omp parallel
{
foobar();
}
printf("\n");
foobar();
return 0;
}
What I'm expecting to happen here is both the parallel region and non-parallel call on foobar() will spit out 4 lines, something to the tune of
// parallel region foobar()
0 | 0
1 | 1
2 | 2
3 | 3
// serial region foobar()
0 | 0
0 | 1
0 | 2
0 | 3
As I am not allowing nested parallelism. However, I get 16 lines within the parallel region with the correct TID, but the OTID is always 0 (i.e. every thread is spawning 4 of its own, and executing the entire loop on that) and I get 4 lines outside (i.e. the parallel for is spawning 4 threads as I would expect)
I feel like I'm missing something very obvious here, can anybody shed some light for me? Isn't disabling nesting supposed to turn that omp parallel for into a regular omp for, and distribute the work accordingly?
Upvotes: 2
Views: 294
Reputation: 9499
Your issue comes from the false assumption that the omp for
directive will be interpreted and the corresponding work distributed among the threads irrespective of which parallel
region is active. Unfortunately, in your code, the omp for
is only associated with the parallel
region that is declared in function foobar()
. Therefore, when this region is activated (meaning since you disabled the nested parallelism, when foobar()
isn't called from another parallel
region) your loop will be distributed among the newly spawn threads. But when it isn't, because foobar()
is called from another parallel
region, then the omp for
is ignored and the loop isn't distributed among the calling threads. So each and every one of them executes the whole loop, leading to the replication of printf()
that you see.
A possible solution would be something like this:
#include <stdio.h>
#include <omp.h>
void bar(int tid) {
#pragma omp for
for (int i = 0; i < 4; i++) {
int otid = omp_get_thread_num();
printf("%d | %d\n", tid, otid);
}
}
void foobar() {
int tid = omp_get_thread_num();
int in_parallel = omp_in_parallel();
if (!in_parallel) {
#pragma omp parallel
bar(tid);
}
else {
bar(tid);
}
}
int main() {
#pragma omp parallel
foobar();
printf("\n");
foobar();
return 0;
}
I don't really find this solution entirely satisfying, but I don't see any better one right now. Maybe later will I get some enlightenment...
EDIT: well I had another idea: doing it the other way around and forcing the nested parallelism, with only one single active thread whenever the function was called from an actual parallel
region:
#include <stdio.h>
#include <omp.h>
void foobar() {
int tid = omp_get_thread_num();
omp_set_nested(1);
#pragma omp single
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
int otid = omp_get_thread_num();
printf("%d | %d\n", tid, otid);
}
}
int main() {
#pragma omp parallel
foobar();
printf("\n");
foobar();
return 0;
}
And this time the code looks much nicer without any duplication, and gives (for example):
$ OMP_NUM_THREADS=4 ./nested
3 | 2
3 | 3
3 | 1
3 | 0
0 | 3
0 | 1
0 | 0
0 | 2
Upvotes: 3