prestonphilly
prestonphilly

Reputation: 21

PyTorch cuda out of memory issue

I keep getting the following error when training a model in PyTorch. I have even added the following stuff at the start of my code but I keep getting this. I am running this via a Jupyter Notebook.

import gc
gc.collect()
torch.cuda.empty_cache()

What can I do to fix this?


OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-6-2b42038d1b55> in <module>
     29 
     30         loss_mask = torch.mean((predicted_img - input_tensor) ** 2 * mask / mask_ratio)
---> 31         loss.backward()
     32 
     33         optim.step()

~/anaconda3/envs/ssenv/lib/python3.8/site-packages/torch/_tensor.py in backward(self, gradient, retain_graph, create_graph, inputs)
    490                 inputs=inputs,
    491             )
--> 492         torch.autograd.backward(
    493             self, gradient, retain_graph, create_graph, inputs=inputs
    494         )

~/anaconda3/envs/ssenv/lib/python3.8/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    249     # some Python versions print out the first line of a multi-line function
    250     # calls in the traceback and some print out the last line
--> 251     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252         tensors,
    253         grad_tensors_,

OutOfMemoryError: CUDA out of memory. Tried to allocate 146.00 MiB. GPU 0 has a total capacty of 9.62 GiB of which 100.94 MiB is free. Process 1485727 has 200.00 MiB memory in use. 

Including non-PyTorch memory, this process has 9.49 GiB memory in use. Of the allocated memory 8.96 GiB is allocated by PyTorch, and 385.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Upvotes: 0

Views: 1570

Answers (2)

trsvchn
trsvchn

Reputation: 8981

If you're running your training code inside the Jupyter environment try to Restart the kernel between runs this will free the GPU memory. Otherwise, try to reduce the batch size or use gradient accumulation, here you can find some tips how to do that.

Upvotes: 0

mrw
mrw

Reputation: 152

You have a couple of options you can try.

  1. Reduce the batch size - you can use gradient accumulation to help here too.
  2. Change to mixed precision training - turn the tensors from FP32 to FP16 for example.
  3. If you have access to multiple GPUs, you can explore distributed training. Make sure to choose an appropriate strategy here to reduce memory usage instead of just speeding up batch processing.
  4. Reduce your model size.
  5. Depending on your data, perhaps you can make inputs smaller e.g. reduce image resolution, crop them or use greyscale.

Upvotes: 1

Related Questions