Reputation: 167
I have a PC with the following specs:
My question is, when I run my training program using Keras on roughly 60k images (GPU:1), the program loads the images and the data matrix is 12922.20MB
After this, the program doesn't do anything for a minute and is killed automatically. The same code seems to be training on GPU:1 and working fine with 10k images.
I did try to search online and on SO but I couldnt find/understand much information on how GPU memory is allocated/scales when using Multi-GPU with Keras.
Any help would be appreciated!
Upvotes: 2
Views: 1723
Reputation: 15033
I would first recommend that you check the memory usage when training on a single GPU; I suspect that your dataset is not loaded into the GPU memory but into the RAM.
You can try to set:
1.
import os
#Enable system to see only one of the video cards
os.environ["CUDA_VISIBLE_DEVICES"] = "0"/"1"
Check to see the exact mapping(that tensorflow sees your GPU):
tf.config.list_physical_devices('GPU')
Then, in the terminal you can use nvidia-smi
to check how much GPU memory has been alloted; at the same time, using watch -n K nvidia-smi
When you use multi-gpus, ensure that you use tf.distribute.MirroredStrategy()
and declare your model creation+fit logic like below:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
# Open a strategy scope.
with strategy.scope():
# Everything that creates variables should be under the strategy scope.
# In general this is only model construction & `compile()`.
model = Model(...)
model.compile(...)
model.fit(train_dataset, validation_data=val_dataset, ...)
model.evaluate(test_dataset)
Upvotes: 1