MPI_Bcast using threads (OpenMP) in MPI

Question

The MPI standard 3.0 says in Section 5.13 that

Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication call at a process. In these situations, it is the user’s re- sponsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process.

I wrote the following program which does NOT execute correctly (but compiles) and dumps a core

void main(int argc, char *argv[])
{
int required = MPI_THREAD_MULTIPLE, provided, rank, size, threadID, threadProcRank ; 
MPI_Comm comm = MPI_COMM_WORLD ; 

MPI_Init_thread(&argc, &argv, required, &provided);
MPI_Comm_size(comm, &size);
MPI_Comm_rank(comm, &rank);

int buffer1[10000] = {0} ;
int buffer2[10000] = {0} ; 

#pragma omp parallel private(threadID,threadProcRank) shared(comm, buffer1)
{
    threadID = omp_get_thread_num();
    MPI_Comm_rank(comm, &threadProcRank);
    printf("
My thread ID is %d and I am in process ranked %d", threadID, threadProcRank);

    if(threadID == 0)
        MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);

    If (threadID == 1)
        MPI_Bcast(buffer1, 10000, MPI_INTEGER, 0, comm);
}

    MPI_Finalize();
}

My question is: Two threads in each process having thread ID 0 and thread ID 1 post a broadcast call which can be taken as a MPI_Send() in the root process ( i.e. process 0). I am interpreting it as two loops of MPI_Send() where the destination is the remaining processes. The destination processes also post MPI_Bcast() in thread ID 0 and thread ID 1. These can be taken as two MPI_Recv()'s posted by each process in the two threads. Since the MPI_Bcast() are identical - there should be no matching problems in receiving the messages sent by Process 0 (the root). But still the program does not work. Why ? Is it because of the possibility that messages might get mixed up on different/same collectives on the same communicator ? And since MPI (mpich2) sees the possibility of this, it just does not allow two collectives on the same communicator pending at the same time ?

Hristo Iliev · Accepted Answer

First of all, you are not checking the value of provided where the MPI implementation returns the actually provided thread support level. The standard allows for this level to be lower than the requested one and a correct MPI application would rather do something like:

MPI_Init_thread(&argc, &argv, required, &provided);
if (provided < required)
{
    printf("Error: MPI does not provide the required thread support
");
    MPI_Abort(MPI_COMM_WORLD, 1);
    exit(1);
}

Second, this line of code is redundant:

MPI_Comm_rank(comm, &threadProcRank);

Threads in MPI do not have separate ranks - only processes have ranks. There was a proposal to bring the so-called endpoints in MPI 3.0 which would have allowed a single process to have more than one ranks and to bind them to different threads but it didn't make it into the final version of the standard.

Third, you are using the same buffer variable in both collectives. I guess your intention was to use buffer1 in the call in thread 0 and buffer2 in the call in thread 1. Also MPI_INTEGER is the datatype that corresponds to INTEGER in Fortran. For the C int type the corresponding MPI datatype is MPI_INT.

Fourth, the interpretation of MPI_BCAST as a loop of MPI_SEND and the corresponding MPI_RECV is just that - an interpretation. In reality the implementation is much different - see here. For example, with smaller messages where the initial network setup latency is much higher than the physical data transmission time, binary and binomial trees are used in order to minimise the latency of the collective. Larger messages are usually broken into many segments and then a pipeline is used to pass the segments from the root rank to all the others. Even in the tree distribution case the messages could still be segmented.

The catch is that in practice each collective operation is implemented using messages with the same tag, usually with negative tag values (these are not allowed to be used by the application programmer). That means that both MPI_Bcast calls in your case would use the same tags to transmit their messages and since the ranks would be the same and the communicator is the same, the messages would get all mixed up. Therefore the requirement for doing concurrent collectives only on separate communicators.

There are two reasons why your program crashes. Reason one is that the MPI library does not provide MPI_THREAD_MULTIPLE. The second reason is if the message is split in two unevenly sized chunks, e.g. a larger first part and a smaller second part. The interference between both collective calls could cause the second thread to receive a large first chunk directed to the first thread while waiting for the second smaller chunk. The result would be message truncation and the abort MPI error handler would get called. This usually does not result in segfault and core dumps, so I would suppose that your MPICH2 is simply not compiled as thread-safe.

This is not MPICH2-specific. Open MPI and other implementations are also prone to the same limitations.

MPI_Bcast using threads (OpenMP) in MPI

Answers (1)

Related Questions