Nikhil Saraf
Nikhil Saraf

Reputation: 21

Keras model training on ML engine with multiple workers

I have built a semantic segmentation Keras (tensorflow backend) model and am trying to train it on google cloud ml engine. I have around 200,000 (256x256) images to train on in small batch sizes (10) for around 100 epochs. 1 epoch was taking almost 25 hours when I used just a master device of type complex_model_m_gpu.

I am not sure how Keras models adapt to multi GPU training devices (eg. complex_model_m_gpu). There is no documentation regarding this, but only regarding distributed TensorFlow training. How can I best use the resources available on ML engine to train my model quickly? How does using multiple workers affect the training process. When I add workers to my stack, it shows that the master and worker are both doing the 1 epoch, independent of each other, and they both save different checkpoints. This seems counterproductive.

Upvotes: 2

Views: 879

Answers (1)

rhaertel80
rhaertel80

Reputation: 8399

Leveraging more than 1 GPU takes some modification to your code. Here's one tutorial that you may find helpful. Notice the following lines of code:

# we'll store a copy of the model on *every* GPU and then combine
# the results from the gradient updates on the CPU
with tf.device("/cpu:0"):
    # initialize the model
    model = MiniGoogLeNet.build(width=32, height=32, depth=3,
        classes=10)

# make the model parallel
model = multi_gpu_model(model, gpus=G)

It's generally much more performant to use 1 machine with 1/2/4/8 GPUS rather than using multiple machines. However, if you want to scale beyond the number of GPUs in a single machine, model_to_estimator and invoke train_and_evaluate on the resulting Estimator. Keras is not multi-machine aware, so if you don't do that, each worker will try to run independently, as you observed.

Upvotes: 2

Related Questions