Saran Tunyasuvunakool
Saran Tunyasuvunakool

Reputation: 1109

How to split OpenMP threads into subteams over a loop

Suppose I have a the following function which makes use of #pragma omp parallel internally.

void do_heavy_work(double * input_array);

I now want to do_heavy_work on many input_arrays thus:

void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
    for (int i = 0; i < num_arrays; ++i)
    {
        do_heavy_work(input_arrays[i]);
    }
}

Let's say I have N hardware threads. The implementation above would cause num_arrays invocations of do_heavy_work to occur in a serial fashion, each using all N threads internally to do whatever parallel thing it wants.

Now assume that when num_arrays > 1 it is actually more efficient to parallelise over this outer loop than it is to parallelise internally in do_heavy_work. I now have the following options.

Ideally I want OpenMP to split its team of OMP_NUM_THREADS threads into num_arrays subteams, and then each do_heavy_work can thread over its allocated subteam if given some.

What's the easiest way to achieve this?

(For the purpose of this discussion let's assume that num_arrays is not necessarily known beforehand, and also that I cannot change the code in do_heavy_work itself. The code should work on a number of machines so N should be freely specifiable.)

Upvotes: 3

Views: 2431

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74365

OMP_NUM_THREADS can be set to a list, thus specifying the number of threads at each level of nesting. E.g. OMP_NUM_THREADS=10,4 will tell the OpenMP runtime to execute the outer parallel region with 10 threads and each nested region will execute with 4 threads for a total of up to 40 simultaneously running threads.

Alternatively, you can make your program adaptive with code similar to this one:

void do_many_heavy_work(double ** input_arrays, int num_arrays)
{
    #pragma omp parallel num_threads(num_arrays)
    {
        int nested_team_size = omp_get_max_threads() / num_arrays;
        omp_set_num_threads(nested_team_size);

        #pragma omp for
        for (int i = 0; i < num_arrays; ++i)
        {
            do_heavy_work(input_arrays[i]);
        }
    }
}

This code will not use all available threads if the value of OMP_NUM_THREADS is not divisible by num_arrays. If having different number of threads per nested region is fine (it could result in some arrays being processed faster than others), come up with an idea of how to distribute the threads and set nested_team_size in each thread accordingly. Calling omp_set_num_threads() from within a parallel region only affects nested regions started by the calling thread, so you can have different nested team sizes.

Upvotes: 3

Related Questions