Reputation: 1098
I want to experiment with TensorFlow's in-graph replication in a multi GPU cluster with multiple ps and workers. The CIFAR-10 multi GPU example shows in graph synchronous replication on a single machine. Is there an example available which I can refer to like the example trainer program for between-graph training?
Upvotes: 2
Views: 2448
Reputation: 126154
Generally speaking, we prefer between-graph replication over in-graph replication for distributed training, because between-graph replication is more scalable that (the current implementation of) in-graph replication. The main problem with in-graph replication is that it currently requires you to build multiple copies of the graph structure for your network and materialize them at a single location (i.e. the distributed master). When you have hundreds of replicas, this turns the master into a bottleneck; by contrast in between-graph replication each replica only has a copy of the graph that runs locally.
The downside of between-graph replication is that it makes synchronous training more difficult, because you now have multiple training loops to synchronize, rather than a single loop with a single training op. The tf.train.SyncReplicasOptimizer
used in the distributed Inception trainer provides one way to do synchronous training with between-graph replication.
However, if you want to try in-graph replication, you could do it by modifying the line that assigns a device to each of the towers in the CIFAR-10 example. Instead of assigning the tower to different GPUs in the same process, you can assign them to different GPUs in different worker tasks. For example:
worker_devices = ["/job:worker/task:0/gpu:0", ..., "/job:worker/task:7/gpu:0"]
for worker_device in worker_devices:
with tf.device(worker_device):
# Execute code for building the model replica.
Upvotes: 9