fatdragon
fatdragon

Reputation: 2299

TPU slower than GPU?

I just tried using TPU in Google Colab and I want to see how much TPU is faster than GPU. I got surprisingly the opposite result.

The following is the NN.

  random_image = tf.random_normal((100, 100, 100, 3))
  result = tf.layers.conv2d(random_image, 32, 7)
  result = tf.reduce_sum(result)

Performance results:

CPU: 8s
GPU: 0.18s
TPU: 0.50s

I wonder why.... The complete code for TPU is as follows:

def calc():
  random_image = tf.random_normal((100, 100, 100, 3))
  result = tf.layers.conv2d(random_image, 32, 7)
  result = tf.reduce_sum(result)
  return result

tpu_ops = tf.contrib.tpu.batch_parallel(calc, [], num_shards=8)

session = tf.Session(tpu_address)
try:
  print('Initializing global variables...')
  session.run(tf.global_variables_initializer())
  print('Warming up...')
  session.run(tf.contrib.tpu.initialize_system())
  print('Profiling')
  start = time.time()
  session.run(tpu_ops)
  end = time.time()
  elapsed = end - start
  print(elapsed)
finally:
  session.run(tf.contrib.tpu.shutdown_system())
  session.close()

Upvotes: 5

Views: 9685

Answers (1)

Russell Power
Russell Power

Reputation: 91

Benchmarking devices properly is hard, so please take everything you learn from these examples with a grain of salt. It's better in general to compare specific models you are interested in (e.g. running an ImageNet network) to understand performance differences. That said, I understand it's fun to do this, so...

Larger models will illustrate the TPU and GPU performance better. Your example also is including the compilation time in the cost of the TPU call: every call after the first for a given program and shape will be cached, so you will want to tpu_ops once before starting the timer unless you want to capture the compilation time.

Currently each call to a TPU function copies the weights to the TPU before it can start running, this affects small operations more significantly. Here's an example that runs a loop on the TPU before returning to the CPU, with the following outputs.

  • 1 0.010800600051879883
  • 10 0.09931182861328125
  • 100 0.5581905841827393
  • 500 2.7688047885894775

. So you can actually run 100 iterations of this function in 0.55s.

import os
import time
import tensorflow as tf

def calc(n):
  img = tf.random_normal((128, 100, 100, 3))
  def body(_):
    result = tf.layers.conv2d(img, 32, 7)
    result = tf.reduce_sum(result)
    return result

  return tf.contrib.tpu.repeat(n[0], body, [0.0])


session = tf.Session('grpc://' + os.environ['COLAB_TPU_ADDR'])
try:
  print('Initializing TPU...')
  session.run(tf.contrib.tpu.initialize_system())

  for i in [1, 10, 100, 500]:
    tpu_ops = tf.contrib.tpu.batch_parallel(calc, [[i] * 8], num_shards=8)
    print('Warming up...')
    session.run(tf.global_variables_initializer())
    session.run(tpu_ops)
    print('Profiling')
    start = time.time()
    session.run(tpu_ops)
    end = time.time()
    elapsed = end - start
    print(i, elapsed)
finally:
  session.run(tf.contrib.tpu.shutdown_system())
  session.close()

Upvotes: 9

Related Questions