Pelups
Pelups

Reputation: 78

Tensorflow - Inference time evaluation

I'm evaluating different image classification models using Tensorflow, and specifically inference time using different devices. I was wondering if I have to use pretrained models or not. I'm using a script generating 1000 random input images feeding them 1 by 1 to the network, and calculating mean inference time.

Thank you !

Upvotes: 3

Views: 10816

Answers (2)

Patwie
Patwie

Reputation: 4450

Let me start by a warning:

A proper benchmark of neural networks is done in a wrong way by most people. For GPUs there is disk I/O, memory bandwidth, PCI bandwidth, the GPU speed itself. Then there are implementation faults like using feed_dict in TensorFlow. This is also true for a efficient training of these models.

Let's start by a simple example considering a GPU

import tensorflow as tf
import numpy as np

data = np.arange(9 * 1).reshape(1, 9).astype(np.float32)
data = tf.constant(data, name='data')

activation = tf.layers.dense(data, 10, name='fc')

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    sess.run(tf.global_variables_initializer())
    print sess.run(activation)

All it does is creating a const tensor and apply a fully connected layer. All the operations are placed on the GPU:

fc/bias: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587959: I tensorflow/core/common_runtime/placer.cc:874] fc/bias: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587970: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/read: (Identity)/job:localhost/replica:0/task:0/device:GPU:0
fc/bias/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587979: I tensorflow/core/common_runtime/placer.cc:874] fc/bias/Assign: (Assign)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0
2018-01-25 09:55:01.587988: I tensorflow/core/common_runtime/placer.cc:874] fc/kernel: (VariableV2)/job:localhost/replica:0/task:0/device:GPU:0
fc/kernel/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0
...

Looks good, right? Benchmarking this graph might give a rough estimate how fast the TensorFlow graph can be executed. Just replace tf.layers.dense by your network. If you accept the overhead of using pythons time package, you are done.

But this is, unfortunately, not the entire story. There is copying the result back from the tensor-op 'fc/BiasAdd:0' accessing device memory (GPU) and copying to host memory (CPU, RAM). Hence there is the PCI bandwidth limitation at some point. And there is a python interpreter somewhere sitting as well, taking CPU cycles.

Further, the operations are placed on the GPU, not necessary the values themselves. Not sure, which TF version you are using. But even a tf.const gave no guarantees in older version to be placed on the GPU. Which I only noticed when writing my own Ops. Btw: see my other answer on how TF decides where to place operations.

Now, the hard part: It depends on your graph. Having a tf.cond/tf.where sitting somewhere makes things harder to benchmark. Now, you need to go through all these struggles which you need to address when efficiently training a deep network. Meaning, a simple const cannot address all cases.

A solutions starts by putting/staging some values directly into GPU memory by running

stager = data_flow_ops.StagingArea([tf.float32])
enqeue_op = stager.put([dummy])
dequeue_op = tf.reduce_sum(stager.get())

for i in range(1000):
    sess.run(enqeue_op)

beforehand. But again, the TF resource manager is deciding where it puts values (And there is no guarantee about the ordering or dropping/keeping values).

To sum it up: Benchmarking is a highly complex task as benchmarking CUDA code is complex. Now, you have CUDA and additionally python parts. And it is a highly subjective task, depending on which parts you are interested in (just graph, including disk i/o, ...)

I usually run the graph with a tf.const input as in the example and use the profiler to see whats going on in the graph.

For some general ideas on how to improve runtime performance you might want to read the Tensorflow Performance Guide

Upvotes: 5

Max F.
Max F.

Reputation: 328

So, to clarify, you are only interested in the runtime per inference step and not in the accuracy or any ML related performance metrics?

In this case it should not matter much if you initialize your model from a pretrained checkpoint or just from scratch via the given initializers (e.g. truncated_normal or constant) assigned to each variable in your graph.

The underlying mathematical operations will be the same, mainly matrix-multiply operations for whom it doesn't matter (much) which values the underlying add and multiply operations are performed on.

This could be a bit different, if your graph contains some more advanced control-flow structures like tf.while_loop that can influence the actual size of your graph depending on the values of certain Tensors.

Of course, the time it takes to initialize your graph at the very beginning of program execution will differ depending on if you initialize from scratch or from checkpoint.

Hope this helps.

Upvotes: 2

Related Questions