Distributed tensorflow source code

Question

I wanted to check the source code of the distributed training feature of tensorflow and its overall structure. Worker-PS relations, etc. However I am lost in tensorflow's repository. Can someone guide me through the repository and point the source code I am looking for?

Maxim · Accepted Answer

Unfortunately, not all tensorflow code (especially the part related to distributed computation) is open source. To quote Aurélien Géron from Hands-On Machine Learning with Scikit-Learn and TensorFlow:

The TensorFlow whitepaper presents a friendly dynamic placer algorithm that auto-magically distributes operations across all available devices, taking into account things like the measured computation time in previous runs of the graph, estimations of the size of the input and output tensors to each operation, the amount of RAM available in each device, communication delay when transferring data in and out of devices, hints and constraints from the user, and more. Unfortunately, this sophisticated algorithm is internal to Google; it was not released in the open source version of TensorFlow.

But here are the main entry points of TF distributed in the public repo:

Cluster in tensorflow/python/grappler/cluster.py
Server and ClusterSpec in tensorflow/python/training/server_lib.py
worker_service.proto in tensorflow/core/protobuf/worker_service.proto

To dive deep you'll need to enter native C++ code in tensorflow/core/distributed_runtime package, e.g., here's gRPC server implementation.

Distributed tensorflow source code

Answers (1)

Related Questions