How to train a model on multi gpus with tensorflow2 and keras?

Question

I have an LSTM model that I want to train on multiple gpus. I transformed the code to do this and in nvidia-smi I could see that it is using all the memory of all the gpus and each of the gpus are utilizing around 40% BUT the estimated time for training of each batch was almost the same as 1 gpu.

Can someone please guid me and tell me how I can train properly on multiple gpus?

My code:

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout

import os
from tensorflow.keras.callbacks import ModelCheckpoint



checkpoint_path = "./model/"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = ModelCheckpoint(filepath=checkpoint_path, save_freq= 'epoch', verbose=1 )

# NNET - LSTM
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    regressor = Sequential()

    regressor.add(LSTM(units = 180, return_sequences = True, input_shape = (X_train.shape[1], 3)))
    regressor.add(Dropout(0.2))

    regressor.add(LSTM(units = 180, return_sequences = True))
    regressor.add(Dropout(0.2))

    regressor.add(LSTM(units = 180))
    regressor.add(Dropout(0.2))

    regressor.add(Dense(units = 4))

    regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

regressor.fit(X_train, y_train, epochs = 10, batch_size = 32, callbacks=[cp_callback])

Srihari Humbarwadi · Accepted Answer

Assuming that your batch_size for a single GPU is N and the time taken per batch is X secs.

You can measure the training speed by measuring the time taken for the model to converge, but you have to make sure that you feed in the right batch_size with 2 GPUs since 2 GPUs will have twice the memory of a single GPU you should linearly scale your batch_size to 2N. It might be deceiving to see that the model still takes X secs per batch, but you should know that now your model is seeing 2N samples per batch, and would lead to a quicker convergence because now you can train with a higher learning rate.

If both of your GPUs have their memory utilized and are sitting at 40% utilization there might be multiple reasons

The model is too simple and you don't need all that compute.
Your batch_size is small and your GPUs can handle a bigger batch_size
Your CPU is the bottleneck and thus making the GPUs wait for the data to be ready, this can be the case when you see spikes in GPU utilization
You need to write a better and performant data pipeline. You can find more about efficient data input pipelines here - https://www.tensorflow.org/guide/data_performance

How to train a model on multi gpus with tensorflow2 and keras?

Answers (2)

Related Questions