Srinath Ganesh
Srinath Ganesh

Reputation: 2558

TensorFlow GPU: No Performance increase in HelloWorld code

Background:

I am a Python Developer new to TensorFlow.

System Spec:

I am running TensorFlow on Docker (found installing cuda stuff too complicated, and long, maybe i messed up something)


Basically I am running a kind of HelloWorld code on GPU and CPU and checking what kind of difference will it have and to my surprise there is hardly any!

docker-compose.yml

version: '2.3'

services:
  tensorflow:
    # image: tensorflow/tensorflow:latest-gpu-py3
    image: tensorflow/tensorflow:latest-py3
    runtime: nvidia
    volumes:
      - ./:/notebooks/TensorTest1
    ports:
      - 8888:8888

When I run with image: tensorflow/tensorflow:latest-py3 I get approx 5 seconds.

root@e7dc71acfa59:/notebooks/TensorTest1# python3 hello1.py 
2018-11-18 14:37:24.288321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
TIME: 4.900559186935425
result:  [3. 3. 3. ... 3. 3. 3.]

when I run with image: tensorflow/tensorflow:latest-gpu-py3 I again get approx 5 seconds.

root@baf68fc71921:/notebooks/TensorTest1# python3 hello1.py 
2018-11-18 14:39:39.811575: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-18 14:39:39.877483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-18 14:39:39.878122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.56GiB
2018-11-18 14:39:39.878148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
 2018-11-18 14:44:17.101263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-18 14:44:17.101303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2018-11-18 14:44:17.101313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2018-11-18 14:44:17.101540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3259 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
TIME: 5.82940673828125
result:  [3. 3. 3. ... 3. 3. 3.]

My Code

import tensorflow as tf
import time

with tf.Session():
    start_time = time.time()

    input1 = tf.constant([1.0, 1.0, 1.0, 1.0] * 100 * 100 * 100)
    input2 = tf.constant([2.0, 2.0, 2.0, 2.0] * 100 * 100 * 100)
    output = tf.add(input1, input2)
    result = output.eval()

    duration = time.time() - start_time
    print("TIME:", duration)

    print("result: ", result)

Am I doing something wrong here? Based on prints it seems to be using GPU correctly


Followed these steps at Can I measure the execution time of individual operations with TensorFlow? and I got this enter image description here

Upvotes: 0

Views: 324

Answers (1)

hobbs
hobbs

Reputation: 239841

A GPU is an "external" processor, there's overhead involved in compiling a program for it, running it, sending it data, and retrieving the results. GPUs also have different performance tradeoffs from CPUs. While GPUs are frequently faster for large and complex number-crunching tasks, your "hello world" is too simple. It doesn't do very much with each data item between loading it and saving it (just pairwise addition), and it doesn't do very much at all — a million operations is nothing. That makes any setup/teardown overhead relatively more noticeable. So while the GPU is slower for this program it's still likely to be faster for more useful programs.

Upvotes: 2

Related Questions