Reputation: 4114
I am trying to write a hybrid OpenMP/MPI-program, and am therefore trying to understand the correlation between the number of OpenMP-Threads and MPI-processes. Therefore, I created a small test program:
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
{
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
}
MPI_Finalize();
return 0;
}
which is compiled using
mpicxx -fopenmp -std=c++17 -o omp_mpi source/main.cpp -lgomp
with gcc-9.3.1
and OpenMPI 3
.
Now, when executing it on an i7-6700 with 4c/8t with ./omp_mpi
, I get the following output
I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
i.e. as expected.
When executing it using mpirun -n 1 omp_mpi
I would expect the same, but instead I get
I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
Where are the other threads? When executing it on two MPI-processes instead, I get
I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
i.e. still only two OpenMP-threads, but when executing it on four MPI-processes, I get
I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
Now suddenly I get eight OpenMP-Threads per MPI-processes. Where does that change come from?
Upvotes: 4
Views: 3354
Reputation: 74375
You are observing an interaction between a peculiarity of Open MPI and the GNU OpenMP Runtime libgomp
.
First, the number of threads in OpenMP is controlled by the num-threads ICV (internal control variable) and the way to set it is to either call omp_set_num_threads()
or by setting OMP_NUM_THREADS
in the environment. When OMP_NUM_THREADS
is not set and one does not call omp_set_num_threads()
, the runtime is free to choose whatever it deems reasonable as a default. In the case of libgomp
, the the manual says:
OMP_NUM_THREADS
Specifies the default number of threads to use in parallel regions. The value of this variable shall be a comma-separated list of positive integers; the value specifies the number of threads to use for the corresponding nested level. Specifying more than one item in the list will automatically enable nesting by default. If undefined one thread per CPU is used.
What it fails to mention is that it uses various heuristics to determine the right number of CPUs. On Linux and Windows, the process affinity mask is used for that (if you like to read code, the one for Linux is right here). If the process is bound to a single logical CPU, you only get one thread:
$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
If you bind it to several logical CPUs, their count is used:
$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
This behaviour specific to libgomp
interacts with another behaviour specific to Open MPI. Back in 2013, Open MPI changed its default binding policy. The reasons are somewhat a mix of technical reasons and politics and you can read more on Jeff Squyres' blog (Jeff is a core Open MPI developer).
The moral of the story is:
Always set the number of OpenMP threads and the MPI binding policy explicitly. With Open MPI, the way to set environment variables is with -x
:
$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
Note that I have hyperthreading enabled and so --bind-to core
and --bind-to hwthread
produce different results without explicitly setting OMP_NUM_THREADS
:
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
vs
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
--map-by node:PE=3
gives each MPI rank three processing elements (PEs) per node. When binding to core, a PE is a core. When binding to hardware threads, a PE is a thread and one should use --map-by node:PE=#cores*#threads
, i.e., --map-by node:PE=6
in my case.
Whether the OpenMP runtime respects the affinity mask set by MPI and whether it maps its own thread affinity onto it, and what to do if not, is a completely different story.
Upvotes: 4
Reputation: 2860
The man page for mpirun
explains:
If you are simply looking for how to run an MPI application, you probably want to use a command line of the following form:
% mpirun [ -np X ] [ --hostfile <filename> ] <program>
This will run X copies of in your current run-time environment (...)
Please note that mpirun automatically binds processes as of the start of the v1.8 series. Three binding patterns are used in the absence of any further directives:
Bind to core: when the number of processes is <= 2 Bind to socket: when the number of processes is > 2 Bind to none: when oversubscribed
If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process.
Now, if you specify 1 or 2 MPI processes, mpirun
defaults to --bind-to core
, which results in 2 threads per MPI process.
If, however, you specify 4 MPI processes, mpirun defaults to --bind-to socket
and you have 8 threads per process, as your machine is a single-socket one. I tested it on a laptop (1s/2c/4t) and a workstation (2 sockets, 12 cores per socket, 2 threads per core) and the program (with no np
argument) behaves as specified above: for the workstation there are 24 MPI processes with 24 OpenMP threads each.
Upvotes: 3