Piyush Shrivastava
Piyush Shrivastava

Reputation: 1098

TensorFlow in-graph replication example

I want to experiment with TensorFlow's in-graph replication in a multi GPU cluster with multiple ps and workers. The CIFAR-10 multi GPU example shows in graph synchronous replication on a single machine. Is there an example available which I can refer to like the example trainer program for between-graph training?

Upvotes: 2

Views: 2448

Answers (1)

mrry
mrry

Reputation: 126154

Generally speaking, we prefer between-graph replication over in-graph replication for distributed training, because between-graph replication is more scalable that (the current implementation of) in-graph replication. The main problem with in-graph replication is that it currently requires you to build multiple copies of the graph structure for your network and materialize them at a single location (i.e. the distributed master). When you have hundreds of replicas, this turns the master into a bottleneck; by contrast in between-graph replication each replica only has a copy of the graph that runs locally.

The downside of between-graph replication is that it makes synchronous training more difficult, because you now have multiple training loops to synchronize, rather than a single loop with a single training op. The tf.train.SyncReplicasOptimizer used in the distributed Inception trainer provides one way to do synchronous training with between-graph replication.

However, if you want to try in-graph replication, you could do it by modifying the line that assigns a device to each of the towers in the CIFAR-10 example. Instead of assigning the tower to different GPUs in the same process, you can assign them to different GPUs in different worker tasks. For example:

worker_devices = ["/job:worker/task:0/gpu:0", ..., "/job:worker/task:7/gpu:0"]

for worker_device in worker_devices:
  with tf.device(worker_device):
    # Execute code for building the model replica.

Upvotes: 9

Related Questions