Ujjwal
Ujjwal

Reputation: 1859

Understanding the conceptual basics of Distributed TensorFlow

Let me describe the cluster setup first :

I have gone through the Distributed TensorFlow documentation but there are some functional basics I could not understand properly and hence this question.

Consider the following situation :

If I want to use Distributed TensorFlow to train a model :

  1. How do I specify network addresses to tf.train.ClusterSpec ? What are those network addresses ? In the documentation are names such as localhost:2222 the same names reserved for a particular node with the cluster manager ?
  2. My data is copied to node A. During training will TensorFlow itself be responsible for sending this data as input to the GPU that is on node B ?
  3. Will I need to manually create the TensorFlow Graph for each GPU on each node using tf.device() ?
  4. If I also want to use some additional CPU nodes will I have to have their names beforehand and put them in the code ?

Upvotes: 0

Views: 367

Answers (1)

Yaroslav Bulatov
Yaroslav Bulatov

Reputation: 57893

  1. Yes
  2. Your client creates the graph and executes this graph on the worker. If you use between-graph replication as in the howto with parameter server, your client and worker is the same process. This process only needs to create part of the graph for the current node using with tf.device. If you use within-graph replication with single client, your client needs to create graph for all nodes using multiple with tf.graph sections.

The simplest example of within-graph replication with separate client/worker processes is here

  1. You generally need to configure all the nodes ahead of time through cluster spec, and their names get assigned sequentially as /job:worker/task:0, /job:worker/task:1 etc

Upvotes: 1

Related Questions