Reputation: 185
Premise: Many ranks need to send data to other ranks. The other ranks are
1) Usually a small fraction of the total communicator size
2) Unknown to the receiver
Consequently the receiver doesn't know how many messages they will be receiving or from where.
A possible solution is as follows
Isend all messages
Busy wait
Probe & Recv messages
When all Isends complete
start IBarrier
exit when IBarrier completes
By having each rank reach the barrier when its own Isends have completed, the barrier finishes when the outbound messages on all ranks are "in the air"
Problem: Section 3.7.3 of the MPI 3.1 standard states that completion of a send with MPI_Test or MPI_Wait does not imply that the corresponding operation has completed but only that the buffer is free to be reused.
This apparently results in a race condition where there is a hanging Isend with no corresponding Irecv. The receiving rank is no longer listening, because the IBarrier is reached before the message can be matched with a Irecv.
Upvotes: 1
Views: 308
Reputation: 9489
From the way you expose your problem, it looks to me that there's no synchronization issue at the beginning. Therefore here is the solution I propose:
int sendSizes[size]; // size of the messages I'll send to each process
// Populate sendSizes here
/* ........ */
int recvSizes[size]; // size of the messages I'll receive from each process
MPI_Alltoall( sendSizes, 1, MPI_INT, recvSizes, 1, MPI_INT, MPI_COMM_WORLD );
// Now I know exactly what to expect from each process
// I could use a Irecv - Isend - Waitall approach
// but a MPI_Alltoallv might be more appropriated
int sendDispls[size], recvDispls[size];
sendDispls[0] = recvDispls[0] = 0;
for ( int i = 0; i < size - 1; i++ ) {
sendDispls[i+1] = sendDispls[i] + sendSizes[i];
recvDispls[i+1] = recvDispls[i] + recvSizes[i];
}
int fullSendSize = sendDispls[size-1] + sendSizes[size-1];
double sendBuff[fullSendSize];
// Populate the sending buffer here
/* ........ */
int fullRecvSize = recvDispls[size-1] + recvSizes[size-1];
double recvBuff[fullRecvSize];
MPI_Alltoallv( sendBuff, sendSizes, sendDispls, MPI_DOUBLE,
recvBuff, recvSizes, recvDispls, MPI_DOUBLE,
MPI_COMM_WORLD );
As you can see, the cornerstone of the solution is to use first a call to MPI_Alltoall()
to let every processes know what to expect from every other processes. This global communication shouldn't be an issue in term of process synchronization since all processes are supposed to be synchronized at start.
Once this is done, the initial problem becomes trivial. It can be solved just by using a MPI_Irecv()
loop, followed by a MPI_Isend()
loop and a final MPI_Waitall()
. However, I rather used here a call to MPI_Alltoallv()
to do the same, just to show it's possible too. Simply, in real life, the sending and receiving buffers are likely to be already existing and only the displacements will need to be computed to point to the right locations, sparing you unnecessary data copies.
But again, this part is now trivial so it's up to you to see what's best in your code's context.
Upvotes: 1