Reputation: 1269
I successfully trained the network but got this error during validation:
RuntimeError: CUDA error: out of memory
Upvotes: 116
Views: 432658
Reputation: 40247
Please check if others program are holding your memory. In the case of Jupyter Notebook, you can go to the running tab and shutdown all other programs and try to run yours one again.
Upvotes: 0
Reputation: 11
Do not call model.zero_grad during inference or validation as this will allocate a huge memory.
Upvotes: 0
Reputation: 301
Not using refiner and reducing num_inference_steps
worked for me. From 25 to 15.
You have to adjust these parameters and find out what is capability of your system's GPU. Also here I'm writing my libraries versions may be helpful to someone.
For my NVIDIA GEFORCE RTX 3060 LAPTOP GPU, Python 3.9.0
PyTorch: 2.2.0+cu118
CUDA : 11.8
Diffusers: 0.26.3
Transformers: 4.38.1
these versions worked.
Upvotes: 0
Reputation: 6658
You can try something like this before your training loop
model = UNET( n_channels, n_classes)
for i in range(epochs):
torch.cuda.empty_cache()
model.use_checkpointing() # Or directly you can do 'model = torch.utils.checkpoint(model)'
your_train_function()
where your model definition should be like below block
class UNet(nn.Module):
def __init__(self, n_channels, n_classes, bilinear=False):
super(UNet, self).__init__()
self.n_channels = n_channels
self.n_classes = n_classes
self.bilinear = bilinear
self.inc = (DoubleConv(n_channels, 64))
self.down1 = (Down(64, 128))
self.down2 = (Down(128, 256))
self.down3 = (Down(256, 512))
factor = 2 if bilinear else 1
self.down4 = (Down(512, 1024 // factor))
self.up1 = (Up(1024, 512 // factor, bilinear))
self.up2 = (Up(512, 256 // factor, bilinear))
self.up3 = (Up(256, 128 // factor, bilinear))
self.up4 = (Up(128, 64, bilinear))
self.outc = (OutConv(64, n_classes))
def forward(self, x):
x1 = self.inc(x)
x2 = self.down1(x1)
x3 = self.down2(x2)
x4 = self.down3(x3)
x5 = self.down4(x4)
x = self.up1(x5, x4)
x = self.up2(x, x3)
x = self.up3(x, x2)
x = self.up4(x, x1)
logits = self.outc(x)
return logits
def use_checkpointing(self):
self.inc = torch.utils.checkpoint(self.inc)
self.down1 = torch.utils.checkpoint(self.down1)
self.down2 = torch.utils.checkpoint(self.down2)
self.down3 = torch.utils.checkpoint(self.down3)
self.down4 = torch.utils.checkpoint(self.down4)
self.up1 = torch.utils.checkpoint(self.up1)
self.up2 = torch.utils.checkpoint(self.up2)
self.up3 = torch.utils.checkpoint(self.up3)
self.up4 = torch.utils.checkpoint(self.up4)
self.outc = torch.utils.checkpoint(self.outc)
more details in the following link
Upvotes: 0
Reputation: 212
I had this same error RuntimeError: CUDA error: out of memory
I was able to resolve this on a machine with 4 GPUs by first running nvidia-smi
to learn that GPU 1 is already at full capacity by another user, causing the error as my script also tried to use the first GPU. I then ran export CUDA_VISIBLE_DEVICES=2,3,4
on the cli. My script now runs by looking only for GPUs 2,3,4 and ignoring 1.
In my case, my code actually doesn't need a GPU but was trying to use them, so I set export CUDA_VISIBLE_DEVICES=""
and now it runs on CPU without attempting to use GPU.
Upvotes: 1
Reputation: 569
Find out what other processes are also using the GPU and free up that space.
find the PID of python process by running:
nvidia-smi
and kill it using
sudo kill -9 pid
Upvotes: 0
Reputation: 1387
If you're running Keras/TF in Jupyter on a local server and another notebook is open which was accessing the GPU, you can also get this error. Just halt and close the other notebook(s). This can occur even if the other notebook isn't actively running anything.
This is distinct from PyTorch OOM errors, which typically refer to PyTorch's allocation of GPU RAM and are of the form
OutOfMemoryError: CUDA out of memory. Tried to allocate 734.00 MiB (GPU 0; 7.79 GiB total capacity; 5.20 GiB already allocated; 139.94 MiB free; 6.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Because PyTorch manages a subset of GPU RAM for a given job, it can sometimes draw an OOM error even though there's sufficient available RAM in the GPU (just not enough in Torch's self-allocation)
These errors can be a bit obscure to troubleshoot, but generally three techniques can be helpful:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64"
You can monitor GPU RAM simplistically with watch nvidia-smi
Every 2.0s: nvidia-smi numbaCruncha123: Wed May 31 11:30:57 2023
Wed May 31 11:30:57 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:26:00.0 Off | N/A |
| 37% 33C P2 34W / 175W | 7915MiB / 8192MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2905 C ...user/z_Venv/NC/bin/python 1641MiB |
| 0 N/A N/A 31511 C ...user/z_Venv/NC/bin/python 6271MiB |
+-----------------------------------------------------------------------------+
This will tell you what's using RAM across the entire GPU.
Note: if you've got a notebook running but don't see anything here, it's possible you're running on the CPU.
Upvotes: 1
Reputation: 1
I faced the same issue with my computer. All you have to do is customize your configuration file to match your computer's specifications. Turns out my computer takes image sizes below 600 X 600 and when I adjusted the same in the configuration file, the program ran smoothly.
Upvotes: -3
Reputation: 151
In my experience, this is not a typical CUDA OOM Error caused by PyTorch trying to allocate more memory on the GPU than you currently have.
The giveaway is the distinct lack of the following text in the error message.
Tried to allocate xxx GiB (GPU Y; XXX GiB total capacity; yyy MiB already allocated; zzz GiB free; aaa MiB reserved in total by PyTorch)
In my experience, this is an Nvidia driver issue. A reboot has always solved the issue for me, but there are times when a reboot is not possible.
One alternative to rebooting is to kill all Nvidia processes and reload the drivers manually. I always refer to the unaccepted answer of this question written by Comzyh when performing the driver cycle. Hope this helps anyone trapped in this situation.
Upvotes: 1
Reputation: 486
Not sure if this'll help you or not, but this is what solved the issue for me:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Nothing else in this thread helped.
Upvotes: 1
Reputation: 630
The error occurs because you ran out of memory on your GPU.
One way to solve it is to reduce the batch size until your code runs without this error.
Upvotes: 52
Reputation: 71
If you are getting this error in Google Colab use this code:
import torch
torch.cuda.empty_cache()
Upvotes: 6
Reputation: 10051
Problem solved by the following code:
import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'
Upvotes: 0
Reputation: 382
I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.
Check whether the cause is really due to your GPU memory, by a code below.
import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')
If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.) Pytorch install link
A similar case will happen also for Tensorflow/Keras.
Upvotes: 10
Reputation: 724
I had the same issue and this code worked for me :
import gc
gc.collect()
torch.cuda.empty_cache()
Upvotes: 41
Reputation: 842
If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders
can be controlled via bs=N
where N is the size of the batch.
My dedicated GPU is limited to 2GB of memory, using bs=8
in the following example worked in my situation:
from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'
def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
path, get_image_files(path), valid_pct=0.2, seed=42,
label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
Upvotes: 0
Reputation: 1000
The best way is to find the process engaging gpu memory and kill it:
find the PID of python process from:
nvidia-smi
copy the PID and kill it by:
sudo kill -9 pid
Upvotes: 48
Reputation: 1189
1.. When you only perform validation not training,
you don't need to calculate gradients for forward and backward phase.
In that situation, your code can be located under
with torch.no_grad():
...
net=Net()
pred_for_validation=net(input)
...
Above code doesn't use GPU memory
2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory
Even if docs guides with float(), in case of me, item() also worked like
entire_loss=0.0
for i in range(100):
one_loss=loss_function(prediction,label)
entire_loss+=one_loss.item()
3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()
for one_epoch in range(100):
...
optimizer.step()
del intermediate_variable1,intermediate_variable2,...
Upvotes: 44
Reputation: 1927
It might be for a number of reasons that I try to report in the following list:
biggest_batch_first
description for the BucketIterator in AllenNLP.In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html
Upvotes: 10