arc_lupus
arc_lupus

Reputation: 4114

MPI-size and number of OpenMP-Threads

I am trying to write a hybrid OpenMP/MPI-program, and am therefore trying to understand the correlation between the number of OpenMP-Threads and MPI-processes. Therefore, I created a small test program:

#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>

int main(int args, char *argv[]) {
    int rank, nprocs, thread_id, nthreads, cxx_procs;
    MPI_Init(&args, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel private(thread_id, nthreads, cxx_procs) 
    {
        thread_id = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        cxx_procs = std::thread::hardware_concurrency();
        std::stringstream omp_stream;
        omp_stream << "I'm thread " << thread_id 
        << " out of " << nthreads 
        << " on MPI process nr. " << rank 
        << " out of " << nprocs 
        << ", while hardware_concurrency reports " << cxx_procs 
        << " processors\n";
        std::cout << omp_stream.str();
    }

    MPI_Finalize();
    return 0;
}

which is compiled using

mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgomp

with gcc-9.3.1 and OpenMPI 3. Now, when executing it on an i7-6700 with 4c/8t with ./omp_mpi, I get the following output

I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors

i.e. as expected.
When executing it using mpirun -n 1 omp_mpi I would expect the same, but instead I get

I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors

Where are the other threads? When executing it on two MPI-processes instead, I get

I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors

i.e. still only two OpenMP-threads, but when executing it on four MPI-processes, I get

I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors

Now suddenly I get eight OpenMP-Threads per MPI-processes. Where does that change come from?

Upvotes: 4

Views: 3354

Answers (2)

Hristo Iliev
Hristo Iliev

Reputation: 74375

You are observing an interaction between a peculiarity of Open MPI and the GNU OpenMP Runtime libgomp.

First, the number of threads in OpenMP is controlled by the num-threads ICV (internal control variable) and the way to set it is to either call omp_set_num_threads() or by setting OMP_NUM_THREADS in the environment. When OMP_NUM_THREADS is not set and one does not call omp_set_num_threads(), the runtime is free to choose whatever it deems reasonable as a default. In the case of libgomp, the the manual says:

OMP_NUM_THREADS

Specifies the default number of threads to use in parallel regions. The value of this variable shall be a comma-separated list of positive integers; the value specifies the number of threads to use for the corresponding nested level. Specifying more than one item in the list will automatically enable nesting by default. If undefined one thread per CPU is used.

What it fails to mention is that it uses various heuristics to determine the right number of CPUs. On Linux and Windows, the process affinity mask is used for that (if you like to read code, the one for Linux is right here). If the process is bound to a single logical CPU, you only get one thread:

$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors

If you bind it to several logical CPUs, their count is used:

$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors

This behaviour specific to libgomp interacts with another behaviour specific to Open MPI. Back in 2013, Open MPI changed its default binding policy. The reasons are somewhat a mix of technical reasons and politics and you can read more on Jeff Squyres' blog (Jeff is a core Open MPI developer).

The moral of the story is:

Always set the number of OpenMP threads and the MPI binding policy explicitly. With Open MPI, the way to set environment variables is with -x:

$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi   
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors

Note that I have hyperthreading enabled and so --bind-to core and --bind-to hwthread produce different results without explicitly setting OMP_NUM_THREADS:

mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi 
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors

vs

mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors

--map-by node:PE=3 gives each MPI rank three processing elements (PEs) per node. When binding to core, a PE is a core. When binding to hardware threads, a PE is a thread and one should use --map-by node:PE=#cores*#threads, i.e., --map-by node:PE=6 in my case.

Whether the OpenMP runtime respects the affinity mask set by MPI and whether it maps its own thread affinity onto it, and what to do if not, is a completely different story.

Upvotes: 4

zkoza
zkoza

Reputation: 2860

The man page for mpirun explains:

If you are simply looking for how to run an MPI application, you probably want to use a command line of the following form:

  % mpirun [ -np X ] [ --hostfile <filename> ]  <program>

This will run X copies of in your current run-time environment (...)

Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives:

  Bind to core:     when the number of processes is <= 2
  Bind to socket:   when the number of processes is > 2
  Bind to none:     when oversubscribed

If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.

Now, if you specify 1 or 2 MPI processes, mpirun defaults to --bind-to core, which results in 2 threads per MPI process. If, however, you specify 4 MPI processes, mpirun defaults to --bind-to socket and you have 8 threads per process, as your machine is a single-socket one. I tested it on a laptop (1s/2c/4t) and a workstation (2 sockets, 12 cores per socket, 2 threads per core) and the program (with no np argument) behaves as specified above: for the workstation there are 24 MPI processes with 24 OpenMP threads each.

Upvotes: 3

Related Questions