recommended way of profiling distributed tensorflow

Currently, I am using tensorflow estimator API to train my tf model. I am using distributed training that is almost 20-50 workers and 5-30 parameter servers based on the training data size. Since I do not have access to the session, I cannot use run metadata a=with full trace to look at the chrome trace. I see there are two other approaches :

1) tf.profiler.profile
2) tf.train.profilerhook

I am specifically using tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)

where my estimator is a prebuilt estimator.

Can someone give me some guidance (concrete code samples and code pointers will be really helpful since I am very new to tensorflow) what is the recommended way to profile estimator? Are the 2 approaches getting some different information or serve the same purpose? Also is one recommended over another?

Upvotes: 2

Answers (2)

user11530462

Reputation:

Tensorflow have recently added a way to sample multiple workers.

Please have a look at the API: https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly

The parameter of the above API which is important in this context is :

service_addr: A comma delimited string of gRPC addresses of the workers to profile. e.g. service_addr='grpc://localhost:6009' service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466' service_addr='grpc://localhost:12345,grpc://localhost:23456'

Also, please look at the API, https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly

The parameter of the above API which is important in this context is :

delay_ms: Requests for all hosts to start profiling at a timestamp that is delay_ms away from the current time. delay_ms is in milliseconds. If zero, each host will start profiling immediately upon receiving the request. Default value is None, allowing the profiler guess the best value.

Upvotes: 0

qqfish

Reputation: 201

There are two things you can try:

ProfilerContext

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py Example usage:

with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
  train_loop()

ProfilerService

https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras

You can start a ProfilerServer via tf.python.eager.profiler.start_profiler_server(port) on all workers and parameter servers. And use TensorBoard to capture profile.

Note that this is a very new feature, you may want to use tf-nightly.

Upvotes: 1

recommended way of profiling distributed tensorflow

Answers (2)

Related Questions