Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.

From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.

From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Both of them are using data-parallel, synchronous training approach. So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?

And how is the performance of mirror strategy compared to horovod?

Upvotes: 7

Answers (2)

Minh Nguyen

Reputation: 865

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here. The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl, collective_all_reduce with properly tuned allreduce_merge_scope (e.g. 32), and horovod. I did not see significant differences among these 3.

Upvotes: 0

Ashiq Imran

Reputation: 2281

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.

Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

Upvotes: 0

Tensorflow Mirror Strategy and Horovod Distribution Strategy

Answers (2)

Related Questions