Oleg Dats
Oleg Dats

Reputation: 4133

Why do I get out of memory exception during training model on google cloud ml?

I follow the next tutorial to train object detection TensorFlow 1.3 model. I want to retrain faster_rcnn_resnet101_coco or faster_rcnn_inception_resnet_v2_atrous_coco models with my small data set (1 class, ~100 examples) on Google cloud. I have changed a number of classes and PATH_TO_BE_CONFIGURED as were suggested in the tutorial on relative config files.

Dataset: 12 images, 4032 × 3024, 10-20 labeled bounding boxes per image.

Why do I get out of memory exception?

The replica master 0 ran out-of-memory and exited with a non-zero status of 247.

Please note that I tried different configurations:

  1. scale-tier BASIC_GPU
  2. default config yaml
  3. customized yaml to use instances with more memory

    trainingInput:
      runtimeVersion: "1.0"
      scaleTier: CUSTOM
      masterType: complex_model_l
      workerCount: 7
      workerType: complex_model_s
      parameterServerCount: 3
      parameterServerType: standard
    

Upvotes: 2

Views: 765

Answers (2)

Hafizur Rahman
Hafizur Rahman

Reputation: 2370

If you are working on a large dataset, I'd strongly recommend to use "large_model" in your config file (config.yaml) and you should choose a recent stable version of tensorflow by specifying the runtimeVersion to be "1.4". You have chosen "1.0" which is causing ML engine to select TensorFlow version 1.0. For more information on this, please refer to Runtime Version which says:

"You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version."

Hence, I recommend the following configuration to be used:

trainingInput:
 runtimeVersion: "1.4"
 scaleTier: CUSTOM
 masterType: large_model
 workerCount: 7
 workerType: complex_model_l
 parameterServerCount: 3
 parameterServerType: standard

In the above config,

masterType: large_model

allows you to choose a machine with a lot of memory, specially suited for parameter servers when your model is too large (having many hidden layers or layers with very large numbers of nodes). Hope it helps.

Upvotes: 2

Derek Chow
Derek Chow

Reputation: 732

Could you describe your dataset? In my experience when users run into OOM problems it's typically because the images in their dataset are high resolution. Prescaling the images down to a small size tends to help with memory issues.

Upvotes: 1

Related Questions