sejanDee
sejanDee

Reputation: 11

Tensorflow profiling timelines inconsistent with wall time

I am currently developing and profiling a neural network in tensorflow on a GTX 1070. My first implementation turned out to be rather inefficient due to some expensive FFT ops. The upper timeline in the attatched pic reveals that one training step takes about 115ms. A second (mathematically equivalent) implementation seems to consume much less time, one step takes under 10ms.

timelines (Sorry, I am not allowed to post pics yet)

Surprisingly both methods take approx. the same time when measured in python via:

build model, get train_op...
sess = tf.Session()

import timeit
t = timeit.default_timer()
for i in range(100):
    sess.run(train_op)
dt = timeit.default_timer() - t
print(dt)

In both cases 100 runs take ~17s > 100*115ms > 100*10ms which is more than I would expect from the python overhead. The network is fed using queues but the results are the same directly feeding numpy arrays into the graph so I concluded the empyt queues cannot be the bottleneck.

The timelines are in principle recorded with the difference that I measure two runs:

build model, get train_op...
sess = tf.Session()

def write_timeline(path):
    from tensorflow.python.client import timeline
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(train_op, options=run_options, run_metadata=run_metadata)

    # Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()
    with open(path, 'w') as f:
        f.write(ctf)

write_timeline('timeline_1st_run.json')
write_timeline('timeline_2nd_run.json')

The first runs always take much longer than the corresponding following runs, is tensorflow doing some optimization here or allocating memory for the first time?

timelines_first_runs

Can someone explain these inconistent results? Why does my second implementation take the same time measured in python when the profiling timeline shows that it should run much faster?

Upvotes: 1

Views: 1547

Answers (1)

Yaroslav Bulatov
Yaroslav Bulatov

Reputation: 57923

First run can take longer because of global memory pool growing, PTX compilation and optimizations.

You could turn off some of the optimizations as follows

sess = tf.Session(config=tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))

However, PTX compilation can still add 1-2 seconds to the first run. So for profiling purposes people usually include a "pre-warming" run to run the same computation before profiling it.

Also, if your train_op is something like assign or assign_add, running train_op will fetch the result back into Python. Python<->TensorFlow memory transfers are not reflected in the timeline (grpc transfers are also missing, tracking issue). For profiling purposes you could run the op and omit the Python transfer as follows

sess.run(train_op.op)

Upvotes: 1

Related Questions