Understanding the conceptual basics of Distributed TensorFlow

Question

Let me describe the cluster setup first :

I have two nodes (each with 2 GPUs). I refer to them as Node A and Node B
Each node has its own SSD storage.
OAR is the cluster manager that is used.

I have gone through the Distributed TensorFlow documentation but there are some functional basics I could not understand properly and hence this question.

Consider the following situation :

I have copied around 600 GB of data on Node A.
I can use OAR to specifically ask for allocation of 4 GPUs across the two nodes.

If I want to use Distributed TensorFlow to train a model :

How do I specify network addresses to tf.train.ClusterSpec ? What are those network addresses ? In the documentation are names such as localhost:2222 the same names reserved for a particular node with the cluster manager ?
My data is copied to node A. During training will TensorFlow itself be responsible for sending this data as input to the GPU that is on node B ?
Will I need to manually create the TensorFlow Graph for each GPU on each node using tf.device() ?
If I also want to use some additional CPU nodes will I have to have their names beforehand and put them in the code ?

Understanding the conceptual basics of Distributed TensorFlow

Answers (1)

Related Questions