Minoru
Minoru

Reputation: 1730

CUDA_ERROR_OUT_OF_MEMORY on Tensorflow#object_detection/train.py

I'm running Tensorflow Object Detection API to train my own detector using the object_detection/train.py script, found here. The problem is that I'm getting CUDA_ERROR_OUT_OF_MEMORY constantly.

I found some suggestions to reduce the batch size so the trainer consumes less memory, but I reduced from 16 to 4 and I'm still getting the same error. The difference is that when using batch_size=16, the error was thrown in step ~18 and now it is been thrown in step ~70. EDIT: setting batch_size=1 didn't solve the problem, as I still got the error at step ~2700.

What can I do to make it run smoothly until I stop the training proccess? I don't really need to get a fast training.

EDIT: I'm currently using a GTX 750 Ti 2GB for this. The GPU is not being used for anything else than training and providing monitor image. Currently, I'm using only 80 images for training and 20 images for evaluation.

Upvotes: 2

Views: 1366

Answers (3)

DV82XL
DV82XL

Reputation: 6629

Another option is to dedicate the GPU for training and use the CPU for evaluation.

  • Disadvantage: Evaluation will consume large portion of your CPU, but only for a few seconds every time a training checkpoint is created, which is not often.
  • Advantage: 100% of your GPU is used for training all the time

To target CPU, set this environment variable before you run the evaluation script:

export CUDA_VISIBLE_DEVICES=-1

You can explicitly set the evaluate batch job size to 1 in pipeline.config to consume less memory:

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

If you're still having issues, TensorFlow may not be releasing GPU memory between training runs. Try restarting your terminal or IDE and try again. This answer has more details.

Upvotes: 0

Minoru
Minoru

Reputation: 1730

Found the solution for my problem. The batch_size was not the problem, but a higher batch_size made the training memory consumption increase faster, because I was using the config.gpu_options.allow_growth = True configuration. This setting allows Tensorflow to increase memory consumption when needed and tries to use until 100% of GPU memory.

The problem was that I was running the eval.py script at the same time (as recommended in their tutorial) and it was using parte of the GPU memory. When the train.py script tried to use all 100%, the error was thrown.

I solved it by settings the maximum use percentage to 70% for the training proccess. It also solved the problem of stuttering while training. This may not be the optimum value for my GPU, but it is configurable using config.gpu_options.per_process_gpu_memory_fraction = 0.7 setting, for example.

Upvotes: 0

scott huang
scott huang

Reputation: 2678

I think is not about batch_size, because you can start the training at first place.

open a terminal ans run

nvidia-smi -l

to check if there are other process kick in when this error happens. if you set batch_size=16, you can find out pretty quick.

Upvotes: 1

Related Questions