Reputation: 58
I am trying to start a sequence of MPI_Igather calls (non-blocking collectives from MPI 4), do some work, and then whenever an Igather finishes, do some more work on that data.
That works fine, unless I start the Igather's from different threads on each MPI rank. In that case I often get a deadlock, even though I call MPI_Init_thread to make sure that MPI_THREAD_MULTIPLE is provided. Non-blocking collectives do not have a tag to match sends and receives, but I thought this is handled by the MPI_Request object associated with each collective operation?
The most simple example I found failing is this:
I made two variants of this program: igather_threaded.cpp (the code below), which behaves as described above, and igather_threaded_v2.cpp, which gathers everything on MPI rank 0. This version does not deadlock, but the data is not ordered correctly either.
igather_threaded.cpp:
#include <iostream>
#include <memory>
#include <mpi.h>
#define PRINT_ARRAY(STR,PTR) \
for (int r=0; r<nproc; r++){\
if (rank==r){\
for (int j=0; j<nproc; j++){\
(STR) << (PTR)[j] << " ";\
}\
(STR) << std::endl << std::flush;\
}\
MPI_Barrier(MPI_COMM_WORLD);\
}
int main(int argc, char* argv[]) {
int provided;
MPI_Init_thread(&argc,&argv, MPI_THREAD_MULTIPLE, &provided);
if (provided!=MPI_THREAD_MULTIPLE){
std::cerr << "Your MPI installation does not provide threading level MPI_THREAD_MULTIPLE, required for this program." << std::endl;
MPI_Abort(MPI_COMM_WORLD, -1);
}
int rank, nproc;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
std::unique_ptr<double[]> M1(new double[nproc]);
std::unique_ptr<double[]> M2(new double[nproc]);
for (int j=0; j<nproc; j++) M1[j] = rank*nproc+j+1;
PRINT_ARRAY(std::cout,M1);
std::unique_ptr<MPI_Request[]> requests(new MPI_Request[nproc]);
#pragma omp parallel for schedule(static) shared(requests)
for (int j=0; j<nproc; j++){
MPI_Igather(
M1.get()+j, 1, MPI_DOUBLE,
M2.get(), 1, MPI_DOUBLE, j,
MPI_COMM_WORLD, &requests[j]
);
}
MPI_Waitall(nproc,requests.get(),MPI_STATUSES_IGNORE);
if (rank==0) std::cout << " => "<<std::endl;
PRINT_ARRAY(std::cout,M2);
MPI_Finalize();
return 0;
}
My question is: is my code theoretically correct and this is a bug in OpenMPI (4.0.3 used here), or did I miss anything that disallows starting MPI_Igather calls by multiple threads? I also tried it with Intel MPI with similar result.
To compile, I use OpenMPI 4.0.3 (configured to support MPI_THREAD_MULTIPLE), gcc 11.2.0, and the command line
> mpicxx -std=c++17 -fopenmp -o igather_threaded igather_threaded.cpp
To run, I use
mpirun -np 4 --bind-to none env OMP_NUM_THREADS=4 env OMP_PROC_BIND=false ./igather_threaded
You may have to increase the number of MPI ranks and/or OpenMP threads, and/or run it several times because the exact behavior is not deterministic.
Upvotes: 4
Views: 194
Reputation: 5794
In your scheme, because the collectives are started in different threads, they can be started in any order. Having different requests is not enough to disambiguate them: MPI insists that they are started in the same order on each process.
You could fix this by:
#pragma omp parallel for schedule(static) shared(requests) ordered
for (int j=0; j<nprocs; j++) {
#pragma omp ordered
MPI_Igather(M1.data()+j, 1, MPI_DOUBLE,
M2.data(), 1, MPI_DOUBLE, j,
MPI_COMM_WORLD, &requests[j]);
}
but that takes away most of the speedup you were hoping to get from threading.
Upvotes: 0
Reputation: 56
The MPI standard states in §6.12 that
All processes must call collective operations (blocking and nonblocking) in the same order per communicator.
Two Igather operations with different root arguments are different collective operations. Your threads are issuing them without any synchronization on the same communicator. The resulting order may not be the same on all processes.
One solution would be to use different communicators for each thread (if all processes use the same number of threads). That way you have a single collective operation on each communicator and you can be sure that the right threads exchange data among each other. If you repeat that code multiple times, just create them up front and reuse them.
Upvotes: 3