Reputation: 21
I have built a semantic segmentation Keras (tensorflow backend) model and am trying to train it on google cloud ml engine. I have around 200,000 (256x256) images to train on in small batch sizes (10) for around 100 epochs. 1 epoch was taking almost 25 hours when I used just a master device of type complex_model_m_gpu.
I am not sure how Keras models adapt to multi GPU training devices (eg. complex_model_m_gpu). There is no documentation regarding this, but only regarding distributed TensorFlow training. How can I best use the resources available on ML engine to train my model quickly? How does using multiple workers affect the training process. When I add workers to my stack, it shows that the master and worker are both doing the 1 epoch, independent of each other, and they both save different checkpoints. This seems counterproductive.
Upvotes: 2
Views: 879
Reputation: 8399
Leveraging more than 1 GPU takes some modification to your code. Here's one tutorial that you may find helpful. Notice the following lines of code:
# we'll store a copy of the model on *every* GPU and then combine
# the results from the gradient updates on the CPU
with tf.device("/cpu:0"):
# initialize the model
model = MiniGoogLeNet.build(width=32, height=32, depth=3,
classes=10)
# make the model parallel
model = multi_gpu_model(model, gpus=G)
It's generally much more performant to use 1 machine with 1/2/4/8 GPUS rather than using multiple machines. However, if you want to scale beyond the number of GPUs in a single machine, model_to_estimator
and invoke train_and_evaluate
on the resulting Estimator
. Keras is not multi-machine aware, so if you don't do that, each worker will try to run independently, as you observed.
Upvotes: 2