Chris P
Chris P

Reputation: 185

Confirming message receipt with non-blocking communication in MPI

Premise: Many ranks need to send data to other ranks. The other ranks are

1) Usually a small fraction of the total communicator size

2) Unknown to the receiver

Consequently the receiver doesn't know how many messages they will be receiving or from where.

A possible solution is as follows

Isend all messages

Busy wait
    Probe & Recv messages
    When all Isends complete
         start IBarrier
    exit when IBarrier completes

By having each rank reach the barrier when its own Isends have completed, the barrier finishes when the outbound messages on all ranks are "in the air"

Problem: Section 3.7.3 of the MPI 3.1 standard states that completion of a send with MPI_Test or MPI_Wait does not imply that the corresponding operation has completed but only that the buffer is free to be reused.

This apparently results in a race condition where there is a hanging Isend with no corresponding Irecv. The receiving rank is no longer listening, because the IBarrier is reached before the message can be matched with a Irecv.

Upvotes: 1

Views: 308

Answers (1)

Gilles
Gilles

Reputation: 9489

From the way you expose your problem, it looks to me that there's no synchronization issue at the beginning. Therefore here is the solution I propose:

int sendSizes[size]; // size of the messages I'll send to each process
// Populate sendSizes here
/* ........ */

int recvSizes[size]; // size of the messages I'll receive from each process
MPI_Alltoall( sendSizes, 1, MPI_INT, recvSizes, 1, MPI_INT, MPI_COMM_WORLD );

// Now I know exactly what to expect from each process
// I could use a Irecv - Isend - Waitall approach
// but a MPI_Alltoallv might be more appropriated
int sendDispls[size], recvDispls[size];
sendDispls[0] = recvDispls[0] = 0;
for ( int i = 0; i < size - 1; i++ ) {
    sendDispls[i+1] = sendDispls[i] + sendSizes[i];
    recvDispls[i+1] = recvDispls[i] + recvSizes[i];
}
int fullSendSize = sendDispls[size-1] + sendSizes[size-1];
double sendBuff[fullSendSize];
// Populate the sending buffer here
/* ........ */

int fullRecvSize = recvDispls[size-1] + recvSizes[size-1];
double recvBuff[fullRecvSize];
MPI_Alltoallv( sendBuff, sendSizes, sendDispls, MPI_DOUBLE,
               recvBuff, recvSizes, recvDispls, MPI_DOUBLE,
               MPI_COMM_WORLD );

As you can see, the cornerstone of the solution is to use first a call to MPI_Alltoall() to let every processes know what to expect from every other processes. This global communication shouldn't be an issue in term of process synchronization since all processes are supposed to be synchronized at start.

Once this is done, the initial problem becomes trivial. It can be solved just by using a MPI_Irecv() loop, followed by a MPI_Isend() loop and a final MPI_Waitall(). However, I rather used here a call to MPI_Alltoallv() to do the same, just to show it's possible too. Simply, in real life, the sending and receiving buffers are likely to be already existing and only the displacements will need to be computed to point to the right locations, sparing you unnecessary data copies.

But again, this part is now trivial so it's up to you to see what's best in your code's context.

Upvotes: 1

Related Questions