Reputation: 2041
I've written this little MPI_Allreduce
benchmark: bench_mpi.cxx.
It work well with Open MPI 1.8.4
and MPICH 1.4.1
.
The results (1 column for the number of processors, and 1 columns for the corresponding wall clock time) are here or here.
With MPICH 3.1.4
, the wall clock time increase for 7, 8 or more processes: results are here.
In a real code (edit: a Computational Fluid Dynamic software), but for all of the 3 above MPI implementation, I observe the same problem for 7, 8 or more processes, while I expect my code to be scallable to at least 8 or 16 processes.
So I'm trying to understand what could happen with the little benchmark and MPICH 3.1.4
?
Here is a zoom in the figure Rob Latham give in his answer.
What does the code do during the green rectangle? The Mpi_Allreduce
operation starts too late.
I've posted another question on much more simpler code (just the time to execute MPI_Barrier
).
Upvotes: 0
Views: 1501
Reputation: 5223
It's interesting you don't see this with OpenMPI or with earlier versions of MPICH, but the way your code is set up seems guaranteed to cause problems for any MPI collective.
You've given each process a variable amount of work to do. The problem with that is the introduction of "pseudo-synchronization" -- the time other MPI processes spend waiting for the laggard to catch up and participate in the collective.
With point-to-point messaging the costs are clear, and probably follow a LogP model
Collective costs have an additional cost: sometimes a process is blocked waiting for a participating process to send it some needed information. In Allgather, well, all the processes have a data dependency on another.
When you have variable-sized work units, no process can make progress until the largest/slowest processor finishes.
If you instrument with MPE and display the trace in Jumpshot, it's easy to see this effect:
I've added (see https://gist.github.com/roblatham00/b981fc875d3852c6b63f) red boxes for work, and the purple boxes are the default allgather color. The second iteration shows this most clearly: rank 0 spends almost no time in allgather . Rank 2,3,4, and 5 have to wait for the slowpokes to catch up. .
Upvotes: 2