Tensorflow Setup for Distributed Computing

Question

Can anyone provide guidance on how to setup tensorflow to work on many CPUs across a network? All of the examples I have found thus far use only one local box and multi-gpus at best. I have found that I can pass in a list of targets in the session_opts, but I'm not sure how to setup tensorflow on each box to listen for networked nodes/tasks. Any example would be greatly appreciated!

mrry · Accepted Answer

The open-source version (currently 0.6.0) of TensorFlow supports single-process execution only: in particular, the only valid target in the tensorflow::SessionOptions is the empty string, which means "current process."

The TensorFlow whitepaper describes the structure of the distributed implementation (see Figure 3) that we use inside Google. The basic idea is that the Session interface can be implemented using RPC to a master; and the master can partition the computation across a set of devices in multiple worker processes, which also communicate using RPC. Alas, the current version depends heavily on Google-internal technologies (like Borg), so a lot of work remains to make it ready for external consumption. We are currently working on this, and you can follow the progress on this GitHub issue.

EDIT on 2/26/2016: Today we released an initial version of the distributed runtime to GitHub. It supports multiple machines and multiple GPUs.

Tensorflow Setup for Distributed Computing

Answers (1)

Related Questions