MPI Send and Receive modes for large number of processors

Question

I know there are a ton of questions & answers about the different modes of MPI send and receive out there, but I believe mine is different or I am simply not able to apply these answers to my problem.

Anyway, my scenario is as follows. The code is intended for high performance clusters with potentially thousands of cores, organized into a multi-dimensional grid. In my algorithm, there are two successive operations that need to be carry out, let's call them A and B, where A precedes B. A brief description is as follows:

A: Each processor has multiple buffers. It has to send each of these buffers to a certain set of processors. For each buffer to be sent, the set of receiving processors might differ. Sending is the last step of operation A.

B: Each processor receives a set of buffers from a set of processors. Operation B will then work on these buffers once it has received all of them. The result of that operation will be stored in a fixed location (neither in the send or receive buffers)

The following properties are also given:

in A, every processor can compute which processors to send to, and it can also compute a corresponding tag in case that a processor receives multiple buffers from the same sending processor (which is very likely).
in B, every processor can also compute which processors it will receive from, and the corresponding tags that the messages were sent with.
Each processor has its own send buffers and receive buffers, and these are disjoint (i.e. there is no processor that uses its send buffer as a receive buffer as well, and vice versa).
A and B are executed in a loop among other operations before A and after B. We can ensure that the send buffer will not be used again until the next loop iteration, where it is filled with new data in A, and the receive buffers will also not be used again until the next iteration where they are used to receive new data in operation B.
The transition between A and B should, if possible, be a synchronization point, i.e. we want to ensure that all processors enter B at the same time

To send and receive, both A and B have to use (nested) loops on their own to send and receive the different buffers. However, we cannot make any assumption about the order of these send and receive statements, i.e. for any two buffers buf0 and buf1 we cannot guarantee that if buf0 is received by some processor before buf1, that buf0 was also sent before buf1. Note that at this point, using group operations like MPI_Broadcast etc. is not an option yet due to the complexity of determining the set of receiving/sending processors.

Question: Which send and receive modes should I use? I have read a lot of different stuff about these different modes, but I cannot really wrap my head around them. The most important property is deadlock-freedom, and the next important thing is performance. I am tending towards using MPI_Isend() in A without checking the request status and again using the non-blocking MPI_IRecv() in B's loops, and then using MPI_Waitall() to ensure that all buffers were received (and as a consequence, also all buffers have been sent and the processors are synchronized).

Is this the correct approach, or do I have to use buffered sends or something entirely different? I don't have a ton of experience in MPI and the documentation does not really help me much either.

Joe Todd · Accepted Answer

From how you describe your problem, I think MPI_Isend is likely to be the best (only?) option for A because it's guaranteed non-blocking, whereas MPI_Send may be non-blocking, but only if it is able to internally buffer your send buffer.

You should then be able to use an MPI_Barrier so that all processes enter B at the same time. But this might be a performance hit. If you don't insist that all processes enter B at the same time, some can begin receiving messages sooner. Furthermore, given your disjoint send & receive buffers, this should be safe.

For B, you can use MPI_Irecv or MPI_Recv. MPI_Irecv is likely to be faster, because a standard MPI_Recv might be waiting around for a slower send from another process.

Whether or not you block on the receiving end, you ought to call MPI_Waitall before finishing the loop to ensure all send/recv operations have completed successfully.

An additional point: you could leverage MPI_ANY_SOURCE with MPI_Recv to receive messages in a blocking manner and immediately operate on them, no matter what order they arrive. However, given that you specified that no operations on the data happen until all data are received, this might not be that useful.

Finally: as mentioned in these recommendations, you will get best performance if you can restructure your code so that you can just use MPI_SSend. In this case you avoid any buffering at all. To achieve this, you'd have to have all processes first call an MPI_Irecv, then begin sending via MPI_Ssend. It might not be as hard as you think to refactor in this way, particularly if, as you say, each process can work out independently which messages it will receive from whom.

MPI Send and Receive modes for large number of processors

Answers (1)

Related Questions