spektr
spektr

Reputation: 777

Questions about MPI Non-blocking details based on standard

So more recently, I have been developing some asynchronous algorithms in my research. I was doing some parallel performance studies and I have been suspicious that I am not properly understanding some details about the various non-blocking MPI functions.

I've seen some insightful posts on here, namely:

There's a few things I am uncertain about or just want to clarify related to working with non-blocking functionality that I think will help me potentially increase the performance of my current software.

From the Nonblocking Communication part of the MPI 3.0 standard:

A nonblocking send start call initiates the send operation, but does not complete it. The send start call can return before the message was copied out of the send buffer. A separate send complete call is needed to complete the communication, i.e., to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed.

...

If the send mode is standard then the send-complete call may return before a matching receive is posted, if the message is buffered. On the other hand, the receive-complete may not complete until a matching receive is posted, and the message was copied into the receive buffer.

So as a first set of questions about the MPI_Isend (and similarly MPI_Irecv), it seems as though to ensure a non-blocking send finishes, I need to use some mechanism to check that it is complete because in the worst case, there may not be suitable hardware to transfer the data concurrently, right? So if I never use something like MPI_Test or MPI_Wait following the non-blocking send, the MPI_Isend may never actually get its message out, right?

This question applies to some of my work because I am sending messages via MPI_Isend and not actually testing for completeness until I get the expected response message because I want to avoid the overhead of MPI_Test calls. While this approach has been working, it seems faulty based on my reading.

Further, the second paragraph appears to say that for the standard non-blocking send, MPI_Isend, it may not even begin to send any of its data until the destination process has called a matching receive. Given the availability of MPI_Probe/MPI_Iprobe, does this mean an MPI_Isend call will at least send out some preliminary metadata of the message, such as size, source, and tag, so that the probe functions on the destination process can know a message wants to be sent there and so the destination process can actually post a corresponding receive?

Related is a question about the probe. In the Probe and Cancel section, the standard says that

MPI_IPROBE(source, tag, comm, flag, status) returns flag = true if there is a message that can be received and that matches the pattern speci fed by the arguments source, tag, and comm. The call matches the same message that would have been received by a call to MPI_RECV(..., source, tag, comm, status) executed at the same point in the program, and returns in status the same value that would have been returned by MPI_RECV(). Otherwise, the call returns flag = false, and leaves status undefi ned.

Going off of the above passage, it is clear the probing will tell you whether there's an available message you can receive corresponding to the specified source, tag, and comm. My question is, should you assume that the data for the corresponding send from a successful probing has not actually been transferred yet?

It seems reasonable to me now, after reading the standard, that indeed a message the probe is aware of need not be a message that the local process has actually fully received. Given the previous details about the standard non-blocking send, it seems you would need to post a receive after doing the probing to ensure the source non-blocking standard send will complete, because there might be times where the source is sending a large message that MPI does not want to copy into some internal buffer, right? And either way, it seems that posting the receive after a probing is how you ensure that you actually get the full data from the corresponding send to be sent. Is this correct?

This latter question relates to one instance in my code where I am doing a MPI_Iprobe call and if it succeeds, I perform an MPI_Recv call to get the message. However, I think this could be problematic now because I was thinking in my mind that if the probe succeeds, that means it has gotten the whole message already. This implied to me that the MPI_Recv would run quickly, then, since the full message would already be in local memory somewhere. However, I am feeling this was an incorrect assumption now that some clarification on would be helpful.

Upvotes: 2

Views: 794

Answers (1)

Gilles Gouaillardet
Gilles Gouaillardet

Reputation: 8395

The MPI standard does not mandate a progress thread. That means that MPI_Isend() might do nothing at all until communications are progressed. Progress occurs under the hood by most MPI subroutines, MPI_Test(), MPI_Wait() and MPI_Probe() are the most obvious ones.

I am afraid you are mixing progress and synchronous send (e.g. MPI_Ssend()).

MPI_Probe() is a local operation, it means it will not contact the sender and ask if something was sent nor progress it.

Performance wise, you should as much as possible avoid unexpected messages, it means a receive should be posted on one end before the message is sent by the other end.

There is a trade-off between performance and portability here :

  • if you want to write portable code, then you cannot assume there is a MPI progress thread
  • if you want to optimize your application on a given system, you should give a try to a MPI library that implements a progress thread on the interconnect you are using

Keep in mind most MPI implementations (read this is not mandated by the MPI standard, and you should not rely on it) send small messages in eager mode. It means MPI_Send() will likely return immediately if the message is small enough (and small enough depends among other things on your MPI implementation, how it is tuned or which interconnect is used).

Upvotes: 4

Related Questions