Reputation: 1692
I wanted to check the source code of the distributed training feature of tensorflow and its overall structure. Worker-PS relations, etc. However I am lost in tensorflow's repository. Can someone guide me through the repository and point the source code I am looking for?
Upvotes: 0
Views: 361
Reputation: 53758
Unfortunately, not all tensorflow code (especially the part related to distributed computation) is open source. To quote Aurélien Géron from Hands-On Machine Learning with Scikit-Learn and TensorFlow:
The TensorFlow whitepaper presents a friendly dynamic placer algorithm that auto-magically distributes operations across all available devices, taking into account things like the measured computation time in previous runs of the graph, estimations of the size of the input and output tensors to each operation, the amount of RAM available in each device, communication delay when transferring data in and out of devices, hints and constraints from the user, and more. Unfortunately, this sophisticated algorithm is internal to Google; it was not released in the open source version of TensorFlow.
But here are the main entry points of TF distributed in the public repo:
Cluster
in tensorflow/python/grappler/cluster.py
Server
and ClusterSpec
in tensorflow/python/training/server_lib.py
worker_service.proto
in tensorflow/core/protobuf/worker_service.proto
To dive deep you'll need to enter native C++ code in tensorflow/core/distributed_runtime
package, e.g., here's gRPC server implementation.
Upvotes: 1