jason
jason

Reputation: 23

Barrier before MPI_Bcast()?

I see some open source code use MPI_Barrier before broadcasting the root value:

MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);

I am not sure if MPI_Bcast() already has natural blocking feature. If this is true, I may not need MPI_Barrier() to synchronize the progress of all the cores. Then I can only use:

MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);

So which one is correct?

Upvotes: 2

Views: 1598

Answers (1)

Hristo Iliev
Hristo Iliev

Reputation: 74375

There is rarely a need to perform explicit synchronisation in MPI and code like that one makes little sense in general. Ranks in MPI mostly process data locally, share no access to global objects, and synchronise implicitly following the semantics of the send and receive operations. Why should rank i care whether some other rank j has received the broadcast when i is processing the received data locally?

Explicit barriers are generally needed in the following situations:

  • benchmarking - a barrier before a timed region of the code removes any extraneous waiting times resulting from one or more ranks being late to the party
  • parallel I/O - in this case, there is a global object (a shared file) and the consistency of its content may depend on the proper order of I/O operations, hence the need for explicit synchronisation
  • one-sided operations (RMA) - similarly to the parallel I/O case, some RMA scenarios require explicit synchronisation
  • shared-memory windows - those are a subset of RMA where access to memory shared between several ranks doesn't go through MPI calls but rather direct memory read and write instructions are issued, which brings all the problems inherent to shared-memory programming like the possibility of data races occurring and thus the need for locks and barriers into MPI

There are rare cases when the code actually makes sense. Depending on the number of ranks, their distribution throughout the network of processing elements, the size of the data to broadcast, the latency and bandwidth of the interconnect, and the algorithm used by the MPI library to actually implement the data distribution, it may take much longer time to complete when the ranks are even so slightly out of alignment in time due to the phenomenon of delay propagation, which may also apply to the user code itself. Those are pathological cases and usually occur under specific conditions, which is why sometimes you may see code like:

#ifdef UNALIGNED_BCAST_IS_SLOW
MPI_Barrier(MPI_COMM_WORLD);
#endif
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);

or even

if (config.unaligned_bcast_performance_remedy)
  MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);

I've seen at least one MPI-enabled quantum chemistry simulation software package include similar code.

That said, collective operations in MPI are not necessarily synchronising. The only one to guarantee that there is a point in time where all ranks are simultaneously inside the call is MPI_BARRIER. MPI allows ranks to exit early once their participation in the collective operation has finished. For example, MPI_BCAST may be implemented as a linear sequence of sends from the root:

int rank, size;

MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);

if (rank == root)
{
   for (int i = 0; i < size; i++)
      if (i != rank)
         MPI_Send(buffer, count, type, i, SPECIAL_BCAST_TAG, comm);
}
else
{
   MPI_Recv(buffer, count, type, root, SPECIAL_BCAST_TAG, comm, MPI_STATUS_IGNORE);
}

In this case, rank 0 (if root is not 0) or rank 1 (when root is 0) will be the first one to receive the data and since there is no more communication directed to or from it, can safely return from the broadcast call. If the data buffer is large and the interconnect is slow, it will create quite some temporal staggering between the ranks.

Upvotes: 1

Related Questions