Reputation: 775
I am trying to run a forward pass on a convolutional neural network having a convolutional layer, followed by a pooling layer and finally a rectified linear unit (ReLU) activation layer. The details about input data and convolutional layer filters are as follows:
X
: 4-dimensional input data having shape [N, H, W, C]
, where N = 60000
is the batch size, H = 32
is the height of an input image, W = 32
is the width of an input image, and C = 1
is number of channels in an input image.W
: 4-dimensional convolutional filter having shape [F, F, C, Cout]
, where F = 3
is the height and width of the filter, C = 1
is the number of channels in the input image, and Cout = 6
is the number of channels in the output image.There are three approaches to do this.
Approach 1: Without using tf.constant()
or tf.placeholder()
import numpy as np
import tensorflow as tf
X = np.random.random([60000, 32, 32, 1])
W = np.random.random([3, 3, 1, 6])
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
result = sess.run(A) # Takes 14.98 seconds
Approach 2: Using tf.constant()
import numpy as np
import tensorflow as tf
X = tf.constant(np.random.random([60000, 32, 32, 1]), dtype=tf.float64)
W = tf.constant(np.random.random([3, 3, 1, 6]), dtype=tf.float64)
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
result = sess.run(A) # Takes 14.73 seconds
Approach 3: Using tf.placeholder()
import numpy as np
import tensorflow as tf
x = np.random.random([60000, 32, 32, 1])
w = np.random.random([3, 3, 1, 6])
X = tf.placeholder(dtype=tf.float64, shape=[None, 32, 32, 1])
W = tf.placeholder(dtype=tf.float64, shape=[3, 3, 1, 6])
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
result = sess.run(A, feed_dict={X:x, W:w}) # Takes 3.21 seconds
Approach 3 (using tf.placeholder()
) runs almost 4-5X faster than Approach 1 and Approach 2.
All these experiments were conducted on NVIDIA GeForce GTX 1080 GPU.
The question is why do we get almost 4-5X speedup by just using tf.placeholder()
in Approach 3 as compared to Approach 1 and Approach 2?
In its underlying implementation, what is tf.placeholder()
doing, that allows it to have such a good performance?
Upvotes: 4
Views: 230
Reputation: 10474
Shoutouts to @y.selivonchyk for the invaluable experiments, however I feel like the answer doesn't elaborate on why these results occur.
I believe this is not so much about 'placeholder' being "good", but rather about the other two methods being a bad idea.
I would presume that 1) and 2) are actually the same and that 1) converts the array to a constant under the hood -- at least this would explain the identical behavior.
The reason 1) and 2) take so long is that constant
s are embedded explicitly into the computational graph. Because they're quite large tensors, this explains why the graph takes so long to build. However, once the graph is built, subsequent runs are faster because everything is "contained" in there. You should generally try to avoid including large pieces of data in the graph itself -- it should ideally just be a set of instructions for computation (i.e. Tensorflow ops).
With 3), the graph is much faster to build because we do not embedd the huge array in it, just a symbolic placeholder. However, execution is slower than 1) and 2) because the value needs to be fed into the placeholder each time (which also means the data has to be transferred onto the GPU in case you are running on one).
Upvotes: 3
Reputation: 9910
I got 12sec 12sec and 1sec respectively. But.
Your method does not account for set up time: graph construction, memory allocation, graph optimization etc. I took it upon myself to advance your experiments a little bit. Namely, I make 10 calls to session.run() for each method and measure not only total time, but also time for each individual call. Below are results of these experiments. The interesing part is the execution time that the first call takes.
%%time
import numpy as np
import tensorflow as tf
X = np.random.random([60000, 32, 32, 1])
W = np.random.random([3, 3, 1, 6])
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
for i in range(10):
ts = time.time()
result = sess.run(A)
te = time.time()
print('%2.2f sec' % (te-ts))
10.44 sec
0.24 sec
0.23 sec
0.23 sec
0.23 sec
0.24 sec
0.23 sec
0.23 sec
0.24 sec
0.23 sec
CPU times: user 17 s, sys: 7.56 s, total: 24.5 s
Wall time: 13.8 s
2:
%%time
import numpy as np
import tensorflow as tf
X = tf.constant(np.random.random([60000, 32, 32, 1]), dtype=tf.float64)
W = tf.constant(np.random.random([3, 3, 1, 6]), dtype=tf.float64)
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
for i in range(10):
ts = time.time()
result = sess.run(A)
te = time.time()
print('%2.2f sec' % (te-ts))
10.53 sec
0.23 sec
0.23 sec
0.24 sec
0.23 sec
0.23 sec
0.23 sec
0.23 sec
0.23 sec
0.26 sec
CPU times: user 17 s, sys: 7.77 s, total: 24.8 s
Wall time: 14.1 s
3:
%%time
import numpy as np
import tensorflow as tf
x = np.random.random([60000, 32, 32, 1])
w = np.random.random([3, 3, 1, 6])
X = tf.placeholder(dtype=tf.float64, shape=[None, 32, 32, 1])
W = tf.placeholder(dtype=tf.float64, shape=[3, 3, 1, 6])
C = tf.nn.conv2d(X, W, strides=[1,1,1,1], padding="VALID")
P = tf.nn.avg_pool(C, ksize=[1,2,2,1], strides=[1,2,2,1], padding="VALID")
A = tf.nn.relu(P)
with tf.Session() as sess:
for i in range(10):
ts = time.time()
result = sess.run(A, feed_dict={X:x, W:w})
te = time.time()
print('%2.2f sec' % (te-ts))
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
0.45 sec
CPU times: user 2.81 s, sys: 2.31 s, total: 5.12 s
Wall time: 5.02 s
As you can see, for first 2 methods first call to sess.run indeed takes quiet some time (10sec) while method 3 always takes .45 sec. But, the second and further runs of the first two are twice as fast with .23 sec.
Upvotes: 4