Reputation: 23
I see some open source code use MPI_Barrier before broadcasting the root value:
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
I am not sure if MPI_Bcast()
already has natural blocking feature. If this is true, I may not need MPI_Barrier()
to synchronize the progress of all the cores. Then I can only use:
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
So which one is correct?
Upvotes: 2
Views: 1598
Reputation: 74375
There is rarely a need to perform explicit synchronisation in MPI and code like that one makes little sense in general. Ranks in MPI mostly process data locally, share no access to global objects, and synchronise implicitly following the semantics of the send and receive operations. Why should rank i
care whether some other rank j
has received the broadcast when i
is processing the received data locally?
Explicit barriers are generally needed in the following situations:
There are rare cases when the code actually makes sense. Depending on the number of ranks, their distribution throughout the network of processing elements, the size of the data to broadcast, the latency and bandwidth of the interconnect, and the algorithm used by the MPI library to actually implement the data distribution, it may take much longer time to complete when the ranks are even so slightly out of alignment in time due to the phenomenon of delay propagation, which may also apply to the user code itself. Those are pathological cases and usually occur under specific conditions, which is why sometimes you may see code like:
#ifdef UNALIGNED_BCAST_IS_SLOW
MPI_Barrier(MPI_COMM_WORLD);
#endif
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
or even
if (config.unaligned_bcast_performance_remedy)
MPI_Barrier(MPI_COMM_WORLD);
MPI_Bcast(buffer, N, MPI_FLOAT, 0, MPI_COMM_WORLD);
I've seen at least one MPI-enabled quantum chemistry simulation software package include similar code.
That said, collective operations in MPI are not necessarily synchronising. The only one to guarantee that there is a point in time where all ranks are simultaneously inside the call is MPI_BARRIER
. MPI allows ranks to exit early once their participation in the collective operation has finished. For example, MPI_BCAST
may be implemented as a linear sequence of sends from the root:
int rank, size;
MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &size);
if (rank == root)
{
for (int i = 0; i < size; i++)
if (i != rank)
MPI_Send(buffer, count, type, i, SPECIAL_BCAST_TAG, comm);
}
else
{
MPI_Recv(buffer, count, type, root, SPECIAL_BCAST_TAG, comm, MPI_STATUS_IGNORE);
}
In this case, rank 0
(if root
is not 0
) or rank 1
(when root
is 0
) will be the first one to receive the data and since there is no more communication directed to or from it, can safely return from the broadcast call. If the data buffer is large and the interconnect is slow, it will create quite some temporal staggering between the ranks.
Upvotes: 1