Reputation: 871
Currently, I am using tensorflow estimator API to train my tf model. I am using distributed training that is almost 20-50 workers and 5-30 parameter servers based on the training data size. Since I do not have access to the session, I cannot use run metadata a=with full trace to look at the chrome trace. I see there are two other approaches :
1) tf.profiler.profile
2) tf.train.profilerhook
I am specifically using
tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)
where my estimator is a prebuilt estimator.
Can someone give me some guidance (concrete code samples and code pointers will be really helpful since I am very new to tensorflow) what is the recommended way to profile estimator? Are the 2 approaches getting some different information or serve the same purpose? Also is one recommended over another?
Upvotes: 2
Views: 783
Reputation:
Tensorflow
have recently added a way to sample multiple workers
.
Please have a look at the API: https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly
The parameter of the above API
which is important in this context is :
service_addr
: A comma delimited string of gRPC addresses of the workers to profile. e.g. service_addr='grpc://localhost:6009' service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466' service_addr='grpc://localhost:12345,grpc://localhost:23456'
Also, please look at the API, https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly
The parameter of the above API
which is important in this context is :
delay_ms
: Requests for all hosts to startprofiling
at a timestamp that isdelay_ms
away from the current time.delay_ms
is in milliseconds. If zero, each host will start profiling immediately upon receiving the request. Default value is None, allowing theprofiler
guess the best value.
Upvotes: 0
Reputation: 201
There are two things you can try:
ProfilerContext
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py Example usage:
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
train_loop()
ProfilerService
https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras
You can start a ProfilerServer via tf.python.eager.profiler.start_profiler_server(port)
on all workers and parameter servers. And use TensorBoard to capture profile.
Note that this is a very new feature, you may want to use tf-nightly
.
Upvotes: 1