Why are TensorFlow models trained on Google Cloud ML more accurate than models trained locally?

Question

I trained an Object Detection API model (Mask RCNN with COCO / Inception v2 from the Zoo), with identical configs, TensorFlow and model versions, and identical (custom) datasets for the same number of steps.

On the local machine (tensorflow-gpu on a 1080 TI) I used object_detection/train.py, while on the cloud I used the google ml-engine job calling on the object_detection.train module. Both used the same learning rate.

The cloud run used 5 workers, while the local one had only 1 GPU. They were both set to a batch size of 1.

Why is it that the locally trained model performs far less accurately? The locally trained model tends to have significantly more false positives than its cloud-trained counterpart.

More importantly, what can I do to bring my training on the local machine up to par with the cloud?

Lak · Accepted Answer

Looks like you used 5 workers on the cloud, whereas you used only one GPU locally? Then, the batch size is different.

The effective batchsize is the batchsize you set on the command-line divided by the number of workers. And looks like lower batchsizes really work well on your model. So, to get your local training accuracy higher, reduce the batchsize to 1/5 the value.

Also, if the difference is that significant where you can noticeably tell that the cloud model is better, then perhaps you should do hyperparameter tuning to find better parameters. Do this on a BASIC_GPU setting so that the same settings on cloud will also work locally.

Why are TensorFlow models trained on Google Cloud ML more accurate than models trained locally?

Answers (1)

Related Questions