AmirHJ
AmirHJ

Reputation: 837

tensorflow: CUDA_ERROR_OUT_OF_MEMORY always happen

I'm going to train a seq2seq model using tf-seq2seq package by 1080 ti (11GB) GPU. I always get the following error using different network's size (even nmt_small):

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:03:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 10.91G (11715084288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12337 get requests, put_count=10124 evicted_count=1000 eviction_rate=0.0987752 and unsatisfied allocation rate=0.268542
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ../model/model.ckpt.
INFO:tensorflow:step = 1, loss = 5.07399

It seems that tensorflow try to occupy the total amount of the GPU's memory (10.91GiB) but clearly only 10.75GiB is available.

Upvotes: 0

Views: 5922

Answers (2)

ml4294
ml4294

Reputation: 2629

In addition to both of the suggestions made concerning the memory growth, you can also try:

sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90

with tf.Session(config=sess_config) as sess:
   ...

With this you can limit the amount of GPU memory allocated by the program, in this case to 90 percent of the available GPU memory. Maybe this is sufficient to solve your problem of the network trying to allocate more memory than available. If this is not sufficient, you will have to decrease the batch size or the network's size.

Upvotes: 1

Ali
Ali

Reputation: 974

you should notice some tips:

1- use memory growth, from tensorflow document: "in some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this."

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

2- are you use batch to training? or feed whole data at once? if yes, then decrease your batch size

Upvotes: 2

Related Questions