NotAName
NotAName

Reputation: 4322

Not seeing performance improvement when running TensorFlow on GPU

I installed Cuda and cuDNN as per instructions on TF help page and it appears that everything is working correectly. If I print the available GPUs I get:

>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Out: Num GPUs Available:  1

Also when I start training the sequential model in the output I get that all necessary libraries have loded correctly and that a GPU device successfully created:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4733 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)

But I'm not seeing any major improvements in training performance. It's about the same as it was before when training on the CPU and I'd assume that my RTX 3060 should provide a bit of a boost.

Should I be seeing an improvement when training a relatively simple Sequential model?

EDIT: If I disable GPU training and train on CPU only using:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

The training time of the model on CPU is 21.14 seconds, on GPU the training takes 57.59(!!!) seconds.

I also don't see GPU load increase as expected during training:

enter image description here

Also the code for the model I'm training:

import datetime as dt
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
from tensorflow import keras
import numpy as np

EPOCHS = 50
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10  # Number of outputs
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2
DROPOUT = 0.3

mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# X_train is 60,000 rows of 28x28 values
# Reshape it to 60,000x784
RESHAPED = 784

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize inputs between 0 and 1
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# One-hot encoding of labels
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)

# Build the model
model = tf.keras.models.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer2', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(NB_CLASSES, input_shape=(RESHAPED,),
          name='dense_layer3', activation='softmax'))

# Print summary of the model
model.summary()

# Compiling the model
model.compile(optimizer='Adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

t = dt.datetime.now()
# Training the model
model.fit(X_train, Y_train, batch_size=BATCH_SIZE,
          epochs=EPOCHS, verbose=VERBOSE,
          validation_split=VALIDATION_SPLIT)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy: ', test_acc)
print(f'Training elapsed: {dt.datetime.now()-t}')

Upvotes: 1

Views: 568

Answers (1)

NotAName
NotAName

Reputation: 4322

I'll just put an answer here in case it will be useful to anyone in the future. From the information provided in the comments and also answer to this post the slowness appears to be a result of a combination of couple factors.

For starters, on small matrices, matrix multiplication on CPU is significantly faster due to higher clock speed. Secondly, there's a significant overhead in transfering data between CPU and GPU and on smaller inputs any performance gains from GPU processing are eaten by the overhead.

As a result on MNIST dataset where input has a shape (784,) the processing times are as follows:

CPU - 21s

GPU - 57s

At the same time on IMDB dataset where input has a shape (10000,) the gains from GPU processing are now significant:

CPU - 4min 40s

GPU - 1min 23s

So for small inputs it's best to disable the GPU processing for faster fitting of the model using something like:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Upvotes: 1

Related Questions