el.mojito
el.mojito

Reputation: 170

Forking once and then use the threads in a different procedure?

If I fork inside my main program and then call a subroutine inside a single directive, what is the behavior if I enter an OMP parallel directive in this subroutine?

My guess/hope is that existing threads are used, as they all should have nothing to do at the moment.

Pseudo-Example:

double A[];

int main() {
   #pragma omp parallel num_threads(2)
   {
      #pragma omp single
      {
         for (int t=0; t<1000; t++) {
            evolve();
         }
      }
   }
}

 void evolve() {
    #pragma omp parallel for num_threads(2)
    for (int i=0; i<100; i++) {
       do_stuff(i);
    }
 }

 void do_stuff(int i) {
    // expensive calculation on array element A[i]
 }

As evolve() is called very often, forking here would cause way to much overhead, so I would like to do it only once, then call evolve() from a single thread and split the work of the calls to do_stuff() over the existing threads.

For Fortran this seems to work. I get a roughly 80-90% speed increase on a simple example using 2 threads. But for C++ I get a different behavior, only the thread which executes the single directive is used for the loop in evolve()

I fixed the problem using the task directive in the main program and passing the limits to evolve(), but this looks like a clumsy solution...

Why is the behavior in Fortran and C++ different and what would be the solution in C++?

Upvotes: 1

Views: 273

Answers (3)

Michael Klemm
Michael Klemm

Reputation: 2873

Unfortunately, your code will likely not work as expected in all cases. If you have a code structure like this:

void foo() {
#pragma omp parallel
#pragma omp single 
   bar();
}
void bar() {
#pragma omp parallel
   printf("...)";
}

OpenMP is requested to create a new team of threads when entering the parallel region in bar. OpenMP calls that "nested parallelism". However, what exactly happens depends on the your actual implementation of OpenMP used and the setting of OMP_NESTED.

OpenMP implementations are not required to support nested parallelism. It would be perfectly legal, if an implementation ignored the parallel region in bar and just execute it with one thread. OMP_NESTED can be used to turn on and off nesting, if the implementation supports it.

In your case, things by chance went well, since you sent all threads to sleep except one. This thread then created a new team of threads of full size (potentially NEW threads, not reusing the old ones). If you omitted the single construct, you would easily get thousands of threads.

Unfortunately, OpenMP does not support your pattern to create a parallel team, have one thread executing the call stacks, and then distribute work across the other team members through a worksharing construct like for. If you need this code pattern, the only solution will be OpenMP tasks.

Cheers, -michael

Upvotes: 1

pburka
pburka

Reputation: 1474

Your example doesn't actually call fork(), so I suspect you don't mean fork in the system-call sense (i.e. duplicating your process). However, if that really is what you meant, I suspect that most OpenMP implementations will not work correctly in a forked process. Typically, threads are not preserved across fork() calls. If the OpenMP implementation you use registers pthread_atfork() handlers, it may work correctly following a fork() call, but it will not use the same threads as the parent process.

Upvotes: 0

Massimiliano
Massimiliano

Reputation: 8042

I believe orphaned directives are the cleanest solution in your case:

double A[];

int main() {
   #pragma omp parallel num_threads(2)
   {
     // Each thread calls evolve() a thousand times
     for (int t=0; t<1000; t++) {
        evolve();
     }
   }
}

 void evolve() {
    // The orphaned construct inside evolve()
    // will bind to the innermost parallel region
    #pragma omp for
    for (int i=0; i<100; i++) {
       do_stuff(i);
    } // Implicit thread synchronization
 }

 void do_stuff(int i) {
    // expensive calculation on array element A[i]
 }

This will work because (section 2.6.1 of the standard):

A loop region binds to the innermost enclosing parallel region

That said, in your code you are using nested parallel constructs. To be sure to enable them you must set the environment variable OMP_NESTED to true, otherwise (quoting Appendix E of the latest standard):

OMP_NESTED environment variable: if the value is neither true nor false the behavior is implementation defined

Upvotes: 1

Related Questions