Keras model training on ML engine with multiple workers

Question

I have built a semantic segmentation Keras (tensorflow backend) model and am trying to train it on google cloud ml engine. I have around 200,000 (256x256) images to train on in small batch sizes (10) for around 100 epochs. 1 epoch was taking almost 25 hours when I used just a master device of type complex_model_m_gpu.

I am not sure how Keras models adapt to multi GPU training devices (eg. complex_model_m_gpu). There is no documentation regarding this, but only regarding distributed TensorFlow training. How can I best use the resources available on ML engine to train my model quickly? How does using multiple workers affect the training process. When I add workers to my stack, it shows that the master and worker are both doing the 1 epoch, independent of each other, and they both save different checkpoints. This seems counterproductive.

Keras model training on ML engine with multiple workers

Answers (1)

Related Questions