Why do I get out of memory exception during training model on google cloud ml?

Question

I follow the next tutorial to train object detection TensorFlow 1.3 model. I want to retrain faster_rcnn_resnet101_coco or faster_rcnn_inception_resnet_v2_atrous_coco models with my small data set (1 class, ~100 examples) on Google cloud. I have changed a number of classes and PATH_TO_BE_CONFIGURED as were suggested in the tutorial on relative config files.

Dataset: 12 images, 4032 × 3024, 10-20 labeled bounding boxes per image.

Why do I get out of memory exception?

The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

Please note that I tried different configurations:

scale-tier BASIC_GPU
default config yaml

customized yaml to use instances with more memory

trainingInput:
  runtimeVersion: "1.0"
  scaleTier: CUSTOM
  masterType: complex_model_l
  workerCount: 7
  workerType: complex_model_s
  parameterServerCount: 3
  parameterServerType: standard

Derek Chow · Accepted Answer

Could you describe your dataset? In my experience when users run into OOM problems it's typically because the images in their dataset are high resolution. Prescaling the images down to a small size tends to help with memory issues.

Why do I get out of memory exception during training model on google cloud ml?

Answers (2)

Related Questions