JenyaKh
JenyaKh

Reputation: 2498

Preventing threads from unnecessary exit and keep the pool alive

Suppose I call a program with OMP_NUM_THREADS=16.

The first function calls #pragma omp parallel for num_threads(16).

The second function calls #pragma omp parallel for num_threads(2).

The third function calls #pragma omp parallel for num_threads(16).

Debugging with gdb shows me that on the second call 14 threads exit. And on the third call, 14 new threads are spawned.

Is it possible to prevent 14 threads from exiting on the second call? Thank you.

The proof listings are below.

$ cat a.cpp
#include <omp.h>

void func(int thr) {
    int count = 0;

    #pragma omp parallel for num_threads(thr)
    for(int i = 0; i < 10000000; ++i) {
        count += i;
    }        
}    

int main() {
    func(16);

    func(2);

    func(16);

    return 0;
} 
$ g++ -o a a.cpp -fopenmp -g
$ ldd a
...
libgomp.so.1 => ... gcc-9.3.0/lib64/libgomp.so.1 
...
$ OMP_NUM_THREADS=16 gdb a

...

Breakpoint 1, main () at a.cpp:13
13          func(16);
(gdb) n
[New Thread 0xffffbe24f160 (LWP 27216)]
[New Thread 0xffffbda3f160 (LWP 27217)]
[New Thread 0xffffbd22f160 (LWP 27218)]
[New Thread 0xffffbca1f160 (LWP 27219)]
[New Thread 0xffffbc20f160 (LWP 27220)]
[New Thread 0xffffbb9ff160 (LWP 27221)]
[New Thread 0xffffbb1ef160 (LWP 27222)]
[New Thread 0xffffba9df160 (LWP 27223)]
[New Thread 0xffffba1cf160 (LWP 27224)]
[New Thread 0xffffb99bf160 (LWP 27225)]
[New Thread 0xffffb91af160 (LWP 27226)]
[New Thread 0xffffb899f160 (LWP 27227)]
[New Thread 0xffffb818f160 (LWP 27228)]
[New Thread 0xffffb797f160 (LWP 27229)]
[New Thread 0xffffb716f160 (LWP 27230)]
15          func(2);
(gdb) 
[Thread 0xffffba9df160 (LWP 27223) exited]
[Thread 0xffffb716f160 (LWP 27230) exited]
[Thread 0xffffbca1f160 (LWP 27219) exited]
[Thread 0xffffb797f160 (LWP 27229) exited]
[Thread 0xffffb818f160 (LWP 27228) exited]
[Thread 0xffffbd22f160 (LWP 27218) exited]
[Thread 0xffffb899f160 (LWP 27227) exited]
[Thread 0xffffbda3f160 (LWP 27217) exited]
[Thread 0xffffbb1ef160 (LWP 27222) exited]
[Thread 0xffffb91af160 (LWP 27226) exited]
[Thread 0xffffba1cf160 (LWP 27224) exited]
[Thread 0xffffb99bf160 (LWP 27225) exited]
[Thread 0xffffbb9ff160 (LWP 27221) exited]
[Thread 0xffffbc20f160 (LWP 27220) exited]
17          func(16);
(gdb) 
[New Thread 0xffffbb9ff160 (LWP 27231)]
[New Thread 0xffffbc20f160 (LWP 27232)]
[New Thread 0xffffb99bf160 (LWP 27233)]
[New Thread 0xffffba1cf160 (LWP 27234)]
[New Thread 0xffffbda3f160 (LWP 27235)]
[New Thread 0xffffbd22f160 (LWP 27236)]
[New Thread 0xffffbca1f160 (LWP 27237)]
[New Thread 0xffffbb1ef160 (LWP 27238)]
[New Thread 0xffffba9df160 (LWP 27239)]
[New Thread 0xffffb91af160 (LWP 27240)]
[New Thread 0xffffb899f160 (LWP 27241)]
[New Thread 0xffffb818f160 (LWP 27242)]
[New Thread 0xffffb797f160 (LWP 27243)]
[New Thread 0xffffb716f160 (LWP 27244)]
19          return 0;

Upvotes: 3

Views: 617

Answers (2)

dreamcrash
dreamcrash

Reputation: 51523

One approach would be to create a single parallel region, filter out the threads that will be executing the for, and manually distribute the loop iterations per thread. For simplicity sake, I will assume a parallel for schedule(static, 1):

include <omp.h>

void func(int total_threads) {
    int count = 0;
    int thread_id = omp_get_thread_num();
    if (thread_id < total_threads)
    {
       for(int i = thread_id; i < 10000000; i += total_threads) {
           count += i;
    }
    #pragma omp barrier          
}    

int main() {
    ...
    #pragma omp parallel num_threads(max_threads_to_be_used)
    {
        func(16);
        func(2);
        func(16);
    }
    return 0;
} 

Bear in mind that there is a race condition count += i; that would have to be fixed. In the original code, you could easily fix it by using the reduction clause, namely #pragma omp parallel for num_threads(thr) reduction(sum:count). In the code with the manual for you could solved it as follows:

#include <omp.h>
#include<stdio.h>
#include <stdlib.h>

int func(int total_threads) {
    int count = 0;
    int thread_id = omp_get_thread_num();
    if (thread_id < total_threads)
    {
       for(int i = thread_id; i < 10000000; i += total_threads) 
           count += i;
    }
    return count;        
}    

int main() {
    int max_threads_to_be_used = // the max that you want;
    int* count_array = malloc(max_threads_to_be_used * sizeof(int));
    #pragma omp parallel num_threads(max_threads_to_be_used)
    {
        int count = func(16);
        count += func(2);
        count += func(16);
        count_array[omp_get_thread_num()] = count;
    }
    int count = 0;
    for(int i = 0; i < max_threads_to_be_used; i++) 
        count += count_array[i];
    printf("Count = %d\n", count);
    return 0;
} 

I would say that most of the time, one will have the same number of thread used in each parallel region. So such type of pattern should not be much of an occurrent issue.

Upvotes: 1

Hristo Iliev
Hristo Iliev

Reputation: 74465

The simple answer is that it isn't possible with GCC to force the runtime to keep the threads around. From cursory reading the source code of libgomp, there are no ICVs, portable or vendor-specific, that prevent the termination of excess idle threads in consecutive regions. (someone correct me if I'm wrong)

If you really need to rely on the unportable requirement that the OpenMP runtime uses persistent threads across regions with varying team sizes in between, then use Clang or Intel C++ instead of GCC. Clang's (actually LLVM's) OpenMP runtime is based on the open-source version of Intel's and they both behave the way you want. Again, this is not portable and the behaviour may change in future versions. It is instead advisable to not write your code in such a way that its performance depends on the particularities of the OpenMP implementation. For example, if the loop takes several orders of magnitude more time than the creation of a thread team (which is on the order of tens of microseconds on modern systems), it won't really matter whether the runtime uses persistent threads or not.

If OpenMP overhead is really a problem, e.g., if the work done in the loop is not enough to amortise the overhead, a portable solution is to lift the parallel region and then either re-implement the for worksharing construct like in @dreamcrash's answer or (ab)use OpenMP's loop scheduling by setting a chunk size that will only result in the desired number of threads working on the problem:

#include <omp.h>

void func(int thr) {
    static int count;
    const int N = 10000000;

    int rem = N % thr;
    int chunk_size = N / thr;

    #pragma omp single
    count = 0;

    #pragma omp for schedule(static,chunk_size) reduction(+:count)
    for(int i = 0; i < N-rem; ++i) {
        count += i;
    }

    if (rem > 0) {
        #pragma omp for schedule(static,1) reduction(+:count)
        for(int i = N-rem; i < N; ++i) {
            count += i;
        }
    }

    #pragma omp barrier
}

int main() {
    int nthreads = max of {16, 2, other values of thr};

    #pragma omp parallel num_threads(nthreads)
    {
        func(16);

        func(2);

        func(16);
    }

    return 0;
}

You need chunks of exactly equal sizes in all threads. The second loop is there to take care of the case when thr does not divide the number of iterations. Also, one cannot simply sum across private variables, hence count has to be shared, e.g., by making it static. This is ugly and drags along a bunch of synchronisation necessities that may have overhead comparable with spawning new threads and make the entire exercise pointless.

Upvotes: 2

Related Questions