OutOfMemoryError: CUDA out of memory despite available GPU memory

Question

I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Here are the specifications of my setup and the model training:

GPU: NVIDIA GPU with 24 GB VRAM Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each Training Data: 36,000 training examples with vector length of 600 Training Configuration: 5 epochs, batch size of 16, and fp16 enabled These are my calculations:

Model Size:

GPT-2 model: ~3 GB

Gradients:

Gradients are typically of the same size as the model’s parameters.

Batch Size and Training Examples:

Batch Size: 16

Training Examples: 36,000

Vector Length: 600

Memory Allocation per Batch:

Model: 3 GB (unchanged per batch)

Gradients: 3 GB (unchanged per batch)

Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch

Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch

Based on the above calculations, the memory allocation per batch for my scenario would be approximately:

Model: 3 GB

Gradients: 3 GB

Input and Output Data: 75 KB

I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!

OutOfMemoryError: CUDA out of memory despite available GPU memory

Answers (1)

Related Questions