Reputation: 2575
I think it's a pretty common message for PyTorch users with low GPU memory:
RuntimeError: CUDA out of memory. Tried to allocate X MiB (GPU X; X GiB total capacity; X GiB already allocated; X MiB free; X cached)
I tried to process an image by loading each layer to GPU and then loading it back:
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
But it doesn't seem to be very effective. I'm wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
Upvotes: 186
Views: 766677
Reputation: 12827
While training large deep learning models while using little GPU memory, you can mainly use two ways (apart from the ones discussed in other answers) to avoid CUDA out of memory error.
I did a little experiment of fine-tuning ResNet101 (layer4+fc layers) with AMP for 5 epochs on NVIDIA GeForce RTX 2060 SUPER (Turing), these are the results. For larger workloads, you'll see bigger benefits. By default, FP32 training is used in PyTorch. AMP uses less memory and speeds up the training while maintaining accuracy.
It requires minor changes in your training loop. Check the pytorch docs.
def train(n_epochs, loaders, model, optimizer, criterion, use_amp=False):
scaler = torch.cuda.amp.GradScaler(enabled=use_amp) #1 initialize gradient scaler
for epoch in range(1, n_epochs+1): # epoch
train_loss, valid_loss = 0.0, 0.0; torch.cuda.synchronize(); start_time = time.time()
model.train() # set model to training mode
for batch_idx, (data, target) in enumerate(loaders['train']): # training iteration
data, target = data.to(device), target.to(device)
optimizer.zero_grad() # zero the accumulated gradients
with torch.cuda.amp.autocast(enabled=use_amp): #2 use AMP context
outputs = model(data) # forward pass
loss = criterion(outputs, target)
scaler.scale(loss).backward() #3 call backward pass on scaled loss
scaler.step(optimizer) #4 unscale gradients, do weight update if not Infs or NaNs
scaler.update() #5 update scale factor for next iteration
train_loss += ((1 / (batch_idx + 1)) * (loss.item() - train_loss))
model.eval() # set model to evaluation mode
for batch_idx, (data, target) in enumerate(loaders['valid']): # validation iteration
data, target = data.to(device), target.to(device)
with torch.no_grad():
with torch.cuda.amp.autocast(enabled=use_amp): # AMP
outputs = model(data)
loss = criterion(outputs, target)
valid_loss += ((1 / (batch_idx + 1)) * (loss.item() - valid_loss))
torch.cuda.synchronize(); end_time = time.time(); total_time = round((end_time - start_time)/60, 2)
print(f'Epoch: {epoch} \tTraining Loss: {train_loss:.3f} \tValidation Loss: {valid_loss:.3f} \tTime: {total_time}min')
return model
optimizer.zero_grad()
in alternate iteration, then your effective batch size becomes 2. Read more on pytorch forums.# some code
# Initialize dataset with batch size 10
opt.zero_grad()
for i, (input, target) in enumerate(dataset):
pred = net(input)
loss = crit(pred, target)
# one graph is created here
loss.backward()
# graph is cleared here
if (i+1)%10 == 0:
# every 10 iterations of batches of size 10
opt.step()
opt.zero_grad()
Upvotes: 2
Reputation: 181
TRY NOT USING REFINER.
None of solutions worked for me beside reducing num_inference_steps
to 15.
Also generating without refiner.
You have to adjust parameters and find out what fits for you.
Upvotes: 0
Reputation: 494
In the case where right of the bat, before epoch 1 starts, we get the out of memory error,
torch.cuda.empty_cache()
gc.collect()
couple with lower the batch_size
may work in some case, as noted by previous answers. In my case it was not enough. I did 2 more things:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"
Here you can adjust 1024 to a desired size.
I adjusted the size of the images I was introducing to the network, in the dataset class, particularly in the __getitem__()
method:
def __getitem__(self, i_dex, resize_=(320,480)):
transforms_ = transforms.Compose([
transforms.PILToTensor(),
transforms.ConvertImageDtype(torch.float32),
])
im_ = Image_.open(self.data_paths[i_dex])
if im_.mode !='RGB':
im_ = im_.convert('RGB')
im_ = im_.resize(resize_)
return transforms_(im_), labels[i_dex]
and reduced the batch_size from 40 to 20. Before resizing the maximum batch_size I was able to run was 4. This is very important for contrastive learning models like the SimCLR where the batch size must be larger (256 or more) such that the model learns from multiple contrastive augmentation image pairs.
Edits: Repeating the process above several times, I was able to train the model on a batch size of 400 eventually.
To monitor GPU resources you can use something like glances
. This makes things easier while adjusting parameters.
Upvotes: 1
Reputation: 69
Though not relevant to the original question, I faced the same issue while using https://github.com/oobabooga/text-generation-webui Bing search results in this particular SO page as the top result. I resolved this by increasing the GPU memory:
Upvotes: 3
Reputation: 36584
Send the batches to CUDA iteratively, and make small batch sizes. Don't send all your data to CUDA at once in the beginning. Rather, do it as follows:
for e in range(epochs):
for images, labels in train_loader:
if torch.cuda.is_available():
images, labels = images.cuda(), labels.cuda()
# blablabla
Upvotes: 29
Reputation: 443
I see no one advice wait after collection of garbage. If nothing help you you can try wait befor garbage collected. Try this:
import torch
import time
import gc
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
def clear_gpu_memory():
torch.cuda.empty_cache()
gc.collect()
del variables
def wait_until_enough_gpu_memory(min_memory_available, max_retries=10, sleep_time=5):
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(torch.cuda.current_device())
for _ in range(max_retries):
info = nvmlDeviceGetMemoryInfo(handle)
if info.free >= min_memory_available:
break
print(f"Waiting for {min_memory_available} bytes of free GPU memory. Retrying in {sleep_time} seconds...")
time.sleep(sleep_time)
else:
raise RuntimeError(f"Failed to acquire {min_memory_available} bytes of free GPU memory after {max_retries} retries.")
# Usage example
min_memory_available = 2 * 1024 * 1024 * 1024 # 2GB
clear_gpu_memory()
wait_until_enough_gpu_memory(min_memory_available)
Upvotes: 2
Reputation: 35
Might seem too simplistic but it worked for me; I just closed my VScode and opened it again and then restarted and ran all the cells.
Upvotes: 1
Reputation: 11
If you are working with images, just reduce the input image shape. For example, if you are using 512x512, try 256x256. It worked for me!
Upvotes: 1
Reputation: 109
If you are done training and just want to test with an image, make sure to add a with torch.no_grad() and m.eval() at the beginning:
with torch.no_grad():
for m in self.children():
m.cuda()
m.eval()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
This may seem obvious but it worked on my case. I was trying to use BERT to transform sentences into an embbeding representation. As BERT is a pre-trained model I didn't need to save all the gradients, and they were consuming all the GPU's memory.
Upvotes: 5
Reputation: 57
I faced the same problem and resolved it by degrading the PyTorch version from 1.10.1 to 1.8.1 with code 11.3. In my case, I am using GPU RTX 3060, which works only with Cuda version 11.3 or above, and when I installed Cuda 11.3, it came with PyTorch 1.10.1. So I degraded the PyTorch version, and now it is working fine.
$ pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
2- You can check by reducing train batch size also.
Upvotes: 0
Reputation: 1688
Although
import torch
torch.cuda.empty_cache()
provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
import gc
del variables
gc.collect()
But still after using these commands, the error might appear again because pytorch doesn't actually clears the memory instead clears the reference to the memory occupied by the variables. So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
Another way to get a deeper insight into the alloaction of memory in gpu is to use:
torch.cuda.memory_summary(device=None, abbreviated=False)
wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
Upvotes: 130
Reputation: 1
I meet the same error, and my GPU is GTX1650 with 4g video memory and 16G ram. It worked for me when I reduce the batch_size to 3. Hope this can help you
Upvotes: 0
Reputation: 109
As long as you don't cross a batch size of 32, you will be fine. Just remember to refresh or restart runtime or else even if you reduce the batch size, you will encounter the same error. I set my batch size to 16, it reduces zero gradients from occurring during my training and the model matches the true function much better. Rather than using a batch size of 4 or 8 which causes the training loss to fluctuate than
Upvotes: -1
Reputation: 1530
There is now a pretty awesome library which makes this very simple: https://github.com/rentruewang/koila
pip install koila
in your code, simply wrap the input with lazy:
from koila import lazy
input = lazy(input, batch=0)
Upvotes: -1
Reputation: 1185
Most things are covered, still will add a little.
If torch gives error as "Tried to allocate 2 MiB" etc. it is a mis-leading message. Actually, CUDA runs out of total memory required to train the model. You can reduce the batch size. Say, even if batch size of 1 is not working (happens when you train NLP models with massive sequences), try to pass lesser data, this will help you confirm that your GPU does not have enough memory to train the model.
Also, Garbage collection and cleaning cache part has to be done again, if you want to re-train the model.
Upvotes: 9
Reputation: 639
I would recommend using mixed precision training with PyTorch. It can make training way faster and consume less memory.
Take a look at https://spell.ml/blog/mixed-precision-training-with-pytorch-Xuk7YBEAACAASJam.
Upvotes: 0
Reputation: 1
Although this seems bizarre what I found is there are many sessions running in the background for collab even if we factory reset runtime or we close the tab. I conquered this by clicking on "Runtime" from the menu and then selecting "Manage Sessions". I terminated all the unwanted sessions and I was good to go.
Upvotes: 0
Reputation: 51
Follow these steps:
In my case, when I am training common voice dataset in kaggle kernels the same error raises. I delt with reducing training dataset to 20000,batch size to 16 and model parameter to 112K.
Upvotes: 5
Reputation: 506
There are ways to avoid, but it certainly depends on your GPU memory size:
features, labels in batch:
features, labels = features.to(device), labels.to(device)
.detach()
method to remove tensors from GPU which are not needed.If all of the above are used properly, PyTorch library is already highly optimizer and efficient.
Upvotes: 2
Reputation: 56
I have the same error but fix it by resize my images from ~600 to 100 using the lines:
import torchvision.transforms as transforms
transform = transforms.Compose([
transforms.Resize((100, 100)),
transforms.ToTensor()
])
Upvotes: 0
Reputation: 151
Try not drag your grads too far.
I got the same error when I tried to sum up loss in all batches.
loss = self.criterion(pred, label)
total_loss += loss
Then I use loss.item instead of loss which requires grads, then solved the problem
loss = self.criterion(pred, label)
total_loss += loss.item()
The solution below is credited to yuval reina in the kaggle question
This error is related to the GPU memory and not the general memory => @cjinny comment might not work.
Do you use TensorFlow/Keras or Pytorch?
Try using a smaller batch size.
If you use Keras, Try to decrease some of the hidden layer sizes.
If you use Pytorch:
do you keep all the training data on the GPU all the time?
make sure you don't drag the grads too far
check the sizes of you hidden layer
Upvotes: 14
Reputation: 261
Best way would be lowering down the batch size. Usually it works. Otherwise try this:
import gc
del variable #delete unnecessary variables
gc.collect()
Upvotes: -2
Reputation: 944
Implementation:
Feed the image into gpu batch by batch.
Using a small batch size during training or inference.
Resize the input images with a small image size.
Technically:
a. Compact your network with techniques like model compression, network pruning and quantization.
b. Directly using a more compact network structure like mobileNetv1/2/3.
c. Network architecture search(NAS).
Upvotes: 0
Reputation: 949
Just reduce the batch size, and it will work. While I was training, it gave following error:
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB reserved in total by PyTorch)
And I was using batch size of 32. So I just changed it to 15 and it worked for me.
Upvotes: 66