Reputation: 11
I am currently developing and profiling a neural network in tensorflow on a GTX 1070. My first implementation turned out to be rather inefficient due to some expensive FFT ops. The upper timeline in the attatched pic reveals that one training step takes about 115ms. A second (mathematically equivalent) implementation seems to consume much less time, one step takes under 10ms.
timelines (Sorry, I am not allowed to post pics yet)
Surprisingly both methods take approx. the same time when measured in python via:
build model, get train_op...
sess = tf.Session()
import timeit
t = timeit.default_timer()
for i in range(100):
sess.run(train_op)
dt = timeit.default_timer() - t
print(dt)
In both cases 100 runs take ~17s > 100*115ms > 100*10ms which is more than I would expect from the python overhead. The network is fed using queues but the results are the same directly feeding numpy arrays into the graph so I concluded the empyt queues cannot be the bottleneck.
The timelines are in principle recorded with the difference that I measure two runs:
build model, get train_op...
sess = tf.Session()
def write_timeline(path):
from tensorflow.python.client import timeline
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(train_op, options=run_options, run_metadata=run_metadata)
# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open(path, 'w') as f:
f.write(ctf)
write_timeline('timeline_1st_run.json')
write_timeline('timeline_2nd_run.json')
The first runs always take much longer than the corresponding following runs, is tensorflow doing some optimization here or allocating memory for the first time?
Can someone explain these inconistent results? Why does my second implementation take the same time measured in python when the profiling timeline shows that it should run much faster?
Upvotes: 1
Views: 1547
Reputation: 57923
First run can take longer because of global memory pool growing, PTX compilation and optimizations.
You could turn off some of the optimizations as follows
sess = tf.Session(config=tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
However, PTX compilation can still add 1-2 seconds to the first run. So for profiling purposes people usually include a "pre-warming" run to run the same computation before profiling it.
Also, if your train_op
is something like assign
or assign_add
, running train_op
will fetch the result back into Python. Python<->TensorFlow memory transfers are not reflected in the timeline (grpc transfers are also missing, tracking issue). For profiling purposes you could run the op and omit the Python transfer as follows
sess.run(train_op.op)
Upvotes: 1