Reputation: 1109
Suppose I have a the following function which makes use of #pragma omp parallel
internally.
void do_heavy_work(double * input_array);
I now want to do_heavy_work
on many input_arrays
thus:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
Let's say I have N
hardware threads. The implementation above would cause num_arrays
invocations of do_heavy_work
to occur in a serial fashion, each using all N
threads internally to do whatever parallel thing it wants.
Now assume that when num_arrays > 1
it is actually more efficient to parallelise over this outer loop than it is to parallelise internally in do_heavy_work
. I now have the following options.
#pragma omp parallel for
on the outer loop and set OMP_NESTED=1
. However, by setting OMP_NUM_THREADS=N
I will end up with a large total number of threads (N*num_arrays
) to be spawned.num_arrays < N
.Ideally I want OpenMP to split its team of OMP_NUM_THREADS
threads into num_arrays
subteams, and then each do_heavy_work
can thread over its allocated subteam if given some.
What's the easiest way to achieve this?
(For the purpose of this discussion let's assume that num_arrays
is not necessarily known beforehand, and also that I cannot change the code in do_heavy_work
itself. The code should work on a number of machines so N
should be freely specifiable.)
Upvotes: 3
Views: 2431
Reputation: 74365
OMP_NUM_THREADS
can be set to a list, thus specifying the number of threads at each level of nesting. E.g. OMP_NUM_THREADS=10,4
will tell the OpenMP runtime to execute the outer parallel region with 10 threads and each nested region will execute with 4 threads for a total of up to 40 simultaneously running threads.
Alternatively, you can make your program adaptive with code similar to this one:
void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
#pragma omp parallel num_threads(num_arrays)
{
int nested_team_size = omp_get_max_threads() / num_arrays;
omp_set_num_threads(nested_team_size);
#pragma omp for
for (int i = 0; i < num_arrays; ++i)
{
do_heavy_work(input_arrays[i]);
}
}
}
This code will not use all available threads if the value of OMP_NUM_THREADS
is not divisible by num_arrays
. If having different number of threads per nested region is fine (it could result in some arrays being processed faster than others), come up with an idea of how to distribute the threads and set nested_team_size
in each thread accordingly. Calling omp_set_num_threads()
from within a parallel region only affects nested regions started by the calling thread, so you can have different nested team sizes.
Upvotes: 3