Reputation: 2299
I just tried using TPU in Google Colab and I want to see how much TPU is faster than GPU. I got surprisingly the opposite result.
The following is the NN.
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
Performance results:
CPU: 8s
GPU: 0.18s
TPU: 0.50s
I wonder why.... The complete code for TPU is as follows:
def calc():
random_image = tf.random_normal((100, 100, 100, 3))
result = tf.layers.conv2d(random_image, 32, 7)
result = tf.reduce_sum(result)
return result
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [], num_shards=8)
session = tf.Session(tpu_address)
try:
print('Initializing global variables...')
session.run(tf.global_variables_initializer())
print('Warming up...')
session.run(tf.contrib.tpu.initialize_system())
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()
Upvotes: 5
Views: 9685
Reputation: 91
Benchmarking devices properly is hard, so please take everything you learn from these examples with a grain of salt. It's better in general to compare specific models you are interested in (e.g. running an ImageNet network) to understand performance differences. That said, I understand it's fun to do this, so...
Larger models will illustrate the TPU and GPU performance better. Your example also is including the compilation time in the cost of the TPU call: every call after the first for a given program and shape will be cached, so you will want to tpu_ops
once before starting the timer unless you want to capture the compilation time.
Currently each call to a TPU function copies the weights to the TPU before it can start running, this affects small operations more significantly. Here's an example that runs a loop on the TPU before returning to the CPU, with the following outputs.
. So you can actually run 100 iterations of this function in 0.55s.
import os
import time
import tensorflow as tf
def calc(n):
img = tf.random_normal((128, 100, 100, 3))
def body(_):
result = tf.layers.conv2d(img, 32, 7)
result = tf.reduce_sum(result)
return result
return tf.contrib.tpu.repeat(n[0], body, [0.0])
session = tf.Session('grpc://' + os.environ['COLAB_TPU_ADDR'])
try:
print('Initializing TPU...')
session.run(tf.contrib.tpu.initialize_system())
for i in [1, 10, 100, 500]:
tpu_ops = tf.contrib.tpu.batch_parallel(calc, [[i] * 8], num_shards=8)
print('Warming up...')
session.run(tf.global_variables_initializer())
session.run(tpu_ops)
print('Profiling')
start = time.time()
session.run(tpu_ops)
end = time.time()
elapsed = end - start
print(i, elapsed)
finally:
session.run(tf.contrib.tpu.shutdown_system())
session.close()
Upvotes: 9