Reputation: 2011
When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY
but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True
.The logs are as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Iter 20, Minibatch Loss= 40491.636719
...
And after using nvidia-smi
command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 8107MiB / 8111MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 8105MiB |
+-----------------------------------------------------------------------------+
After I commented the gpu_options.allow_growth = True
, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY
. Finally, ran the nvidia-smi
command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 7793MiB / 8111MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 7791MiB |
+-----------------------------------------------------------------------------+
I have two questions about it. Why did the CUDA_OUT_OF_MEMORY
come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True
.
Upvotes: 53
Views: 107545
Reputation: 307
Check the correctness of the input dataset. İf you have a null input list may occur this error too. The situation that I faced in Colab with tf.keras
Upvotes: 0
Reputation: 1214
I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.
Use nvidia-smi to check the GPU memory usage:
nvidia-smi
nvidia-smi --gpu-reset
The above command may not work if other processes are actively using the GPU.
Alternatively you can use the following command to list all the processes that are using GPU:
sudo fuser -v /dev/nvidia*
And the output should look like this:
USER PID ACCESS COMMAND
/dev/nvidia0: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidiactl: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidia-modeset: root 2216 F.... Xorg
sid 6114 F.... krunner
sid 6116 F.... plasmashell
sid 7227 F.... akonadi_archive
sid 7239 F.... akonadi_mailfil
sid 7249 F.... akonadi_sendlat
sid 18120 F.... chrome
sid 18163 F.... chrome
sid 24154 F.... code
From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.
Use the following command to kill the process by its PID:
sudo kill -9 MY_PID
Replace MY_PID with the relevant PID.
Upvotes: 18
Reputation: 8253
Environment:
1.CUDA 10.0
2.cuNDD 10.0
3.tensorflow 1.14.0
4.pip install opencv-contrib-python
5.git clone https://github.com/thtrieu/darkflow
6.Allowing GPU memory growth
Upvotes: 0
Reputation: 957
Tensorflow 2.0 alpha
The problem is, that Tensorflow is greedy in allocating all available VRAM. That causes issues for some people.
For Tensorflow 2.0 alpha / nightly use this:
import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.4)
Source: https://www.tensorflow.org/alpha/guide/using_gpu
Upvotes: 5
Reputation: 1111
I was experienced memory error in Ubuntu 18.10. When i changed resolution of my monitor from 4k to fullhd (1920-1080) memory available become 438mb and neural network training started. I was really surprised by this behavior.
By the way, i have Nvidia 1080 with 8gb memory, still dont know why only 400mb available
Upvotes: 0
Reputation: 1825
In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.
Upvotes: 57
Reputation: 4467
By default, tensorflow try to allocate a fraction per_process_gpu_memory_fraction
of the GPU memory to his process to avoid costly memory management. (See the GPUOptions comments).
This can fail and raise the CUDA_OUT_OF_MEMORY
warnings.
I do not know what is the fallback in this case (either using CPU ops or a allow_growth=True
).
This can happen if an other process uses the GPU at the moment (If you launch two process running tensorflow
for instance).
The default behavior takes ~95% of the memory (see this answer).
When you use allow_growth = True
, the GPU memory is not preallocated and will be able to grow as you need it. This will lead to smaller memory usage (as the default option is to use the whole memory) but decreases the perfomances if not use properly as it requires a more complex handeling of the memory (which is not the most efficient part of CPU/GPU interactions).
Upvotes: 24