bxshi
bxshi

Reputation: 2292

MPI can not send data to oneself by MPI_Send and MPI_Recv

I'm trying to implement MPI_Bcast, and I'm planning to do that by MPI_Send and MPI_Recv but seems I can not send message to myself?

The code is as follow

void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) {
     int comm_rank, comm_size, i;
     MPI_Comm_rank(comm, &comm_rank);
     MPI_Comm_size(comm, &comm_size);
     if(comm_rank==root){
         for(i = 0; i < comm_size; i++){
                 MPI_Send(buffer, count, datatype, i, 0, comm);
         }
     }
     MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
  }

Any suggestion on that? Or I should never send message to oneself and just do a memory copy?

Upvotes: 2

Views: 9398

Answers (4)

Hristo Iliev
Hristo Iliev

Reputation: 74355

Your program is erroneous on multiple levels. First of all, there is an error in the conditional:

if(comm_rank=root){

This does not compare comm_rank to root but rather assigns root to comm_rank and the loop would then only execute if root is non-zero and besides it would be executed by all ranks.

Second, the root process does not need to send data to itself since the data is already there. Even if you'd like to send and receive anyway, you should notice that both MPI_Send and MPI_Recv peruse the same buffer space, which is not correct. Some MPI implementations use direct memory copy for self-interaction, i.e. the library might use memcpy() to transfer the message. Using memcpy() with overlapping buffers (including using the same buffer) leads to an undefined behaviour.

The proper way to implement linear broadcast is:

void My_MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
{
   int comm_rank, comm_size, i;
   MPI_Comm_rank(comm, &comm_rank);
   MPI_Comm_size(comm, &comm_size);
   if (comm_rank == root)
   {
      for (i = 0; i < comm_size; i++)
      {
         if (i != comm_rank)
            MPI_Send(buffer, count, datatype, i, 0, comm);
      }
   }
   else
      MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE);
}

The usual ways for a process to talk to itself without deadlocking is:

  • using a combination of MPI_Isend and MPI_Recv or a combination of MPI_Send and MPI_Irecv;
  • using buffered send MPI_Bsend;
  • using MPI_Sendrecv or MPI_Sendrecv_replace.

The combination of MPI_Irecv and MPI_Send works well in cases where multiple sends are done in a loop like yours. For example:

MPI_Request req;

// Start a non-blocking receive
MPI_Irecv(buff2, count, datatype, root, 0, comm, &req);
// Send to everyone
for (i = 0; i < comm_size; i++)
   MPI_Send(buff1, count, datatype, i, 0, comm);
// Complete the non-blocking receive
MPI_Wait(&req, MPI_STATUS_IGNORE);

Note the use of separate buffers for send and receive. Probably the only point-to-point MPI communication call that allows the same buffer to be used both for sending and receiving is MPI_Sendrecv_replace as well as the in-place modes of the collective MPI calls. But these are implemented internally in such way that at no time the same memory area is used both for sending and receiving.

Upvotes: 6

Stan Graves
Stan Graves

Reputation: 6955

The MPI_Send / MPI_Recv block on the root node can be a deadlock.

Converting to MPI_Isend could be used to resolve the issue. However, there may be issues because the send buffer is being reused and root is VERY likely to reach the MPI_Recv "early" and then may alter that buffer before it is transmitted to other ranks. This is especially likely on large jobs. Also, if this routine is ever called from fortran there could be issues with the buffer being corrupted on each MPI_Send call.

The use of MPI_Sendrecv could be used only for the root process. That would allow the MPI_Send's to all non-root ranks to "complete" (e.g. the send buffer can be safely altered) before the root process enters a dedicated MPI_Sendrecv. The for loop would simply begin with "1" instead of "0", and the MPI_Sendrecv call added to the bottom of that loop. (Why is a better questions, since the data is in "buffer" and is going to "buffer".)

However, all this begs the question, why are you doing this at all? If this is a simple "academic exercise" in writing a collective with point to point calls, so be it. BUT, your approach is naive at best. This overall strategy would be beaten by any of the MPI_Bcast algorithms in any reasonably implemented mpi.

Upvotes: 1

kraffenetti
kraffenetti

Reputation: 365

This is an incorrect program. You cannot rely on doing a blocking MPI_Send to yourself...because it may block. MPI does not guarantee that your MPI_Send returns until the buffer is available again. In some cases this could mean it will block until the message has been received by the destination. In your program, the destination may never call MPI_Recv, because it is still trying to send.

Now in your My_MPI_Bcast example, the root process already has the data. Why need to send or copy it at all?

Upvotes: 2

Sleepyhead
Sleepyhead

Reputation: 1021

I think you should put MPI_Recv(buffer, count, datatype, root, 0, comm, MPI_STATUS_IGNORE); only for rank=root otherwise it will probably hang

Upvotes: -1

Related Questions