Reputation: 4133
I follow the next tutorial to train object detection TensorFlow 1.3 model. I want to retrain faster_rcnn_resnet101_coco or faster_rcnn_inception_resnet_v2_atrous_coco models with my small data set (1 class, ~100 examples) on Google cloud. I have changed a number of classes and PATH_TO_BE_CONFIGURED as were suggested in the tutorial on relative config files.
Dataset: 12 images, 4032 × 3024, 10-20 labeled bounding boxes per image.
Why do I get out of memory exception?
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.
Please note that I tried different configurations:
customized yaml to use instances with more memory
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: complex_model_l
workerCount: 7
workerType: complex_model_s
parameterServerCount: 3
parameterServerType: standard
Upvotes: 2
Views: 765
Reputation: 2370
If you are working on a large dataset, I'd strongly recommend to use "large_model" in your config file (config.yaml) and you should choose a recent stable version of tensorflow by specifying the runtimeVersion to be "1.4". You have chosen "1.0" which is causing ML engine to select TensorFlow version 1.0. For more information on this, please refer to Runtime Version which says:
"You can specify a supported Cloud ML Engine runtime version to use for your training job. The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version."
Hence, I recommend the following configuration to be used:
trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: large_model
workerCount: 7
workerType: complex_model_l
parameterServerCount: 3
parameterServerType: standard
In the above config,
masterType: large_model
allows you to choose a machine with a lot of memory, specially suited for parameter servers when your model is too large (having many hidden layers or layers with very large numbers of nodes). Hope it helps.
Upvotes: 2
Reputation: 732
Could you describe your dataset? In my experience when users run into OOM problems it's typically because the images in their dataset are high resolution. Prescaling the images down to a small size tends to help with memory issues.
Upvotes: 1