BodneyC
BodneyC

Reputation: 110

OpenMP - Overhead when Spawning and Terminating Threads in for-loop

I'm fairly new to OpenMP and I have some Monte Carlo code I am trying to parallelise.

I have a for-loop which must be ran serially which calls the new_value() function:

for(int i = 0; i < MAX_VAL; i++)
    new_value();

This function opens a parallel region on each call:

void new_value()
{
#pragma omp parallel default(shared)
{
    int thread_rank = omp_get_thread_num();

#pragma omp for schedule(static)
    for(int i = 0; i < N; i++)
        arr[i] = update(thread_rank);
}
}

Which works but there is a significant amount of overhead associated with the spawning and terminating of threads; I was wondering if anyone knew a way to spawn the threads (and attain thread_rank) before entering the loop without parallelising the loop?

There are several questions asking the same thing but they are either wrong or unanswered, examples of which include:

This question which asks a similar thing and the answer suggests creating a parallel region and then using #pragma omp single on the outer-most loop, but as 'Joe C' said in the answer comments, this does not work. I can confirm that the program just hangs.

This question asks the exact same thing but the (unticked) answer is just to parallelise the outer-most loop running the loop 4000 * num_threads which is neither what the asker wanted nor what I want.

Upvotes: 1

Views: 981

Answers (1)

Zulan
Zulan

Reputation: 22660

The answer to your second question is actually correct.

#pragma omp parallel
for(int i = 0; i < MAX_VAL; i++)
    new_value();

void new_value()
{
    int thread_rank = omp_get_thread_num();

#pragma omp for schedule(static)
    for(int i = 0; i < N; i++)
        arr[i] = update(thread_rank);
}

Is correct and exactly what you want. It has the same semantic as the code in your question. The difference is there is only one parallel region and that the loop variable i is now computed by the whole team. Note that the outer loop is not parallelized in a worksharing manner (omp parallel for).

So when this code is run, num_threads threads will execute the loop header once new_value and reach the omp for all with their private i == 0. They will share the work of the inner loop. Then they will wait until everyone completed the loop at an implicit barrier, increment their private i and repeat... I hope it is clear now that this is the same behavior with respect to the inner loop as before, with less thread management overhead.

Upvotes: 0

Related Questions