Reputation: 128
I’m encountering an issue with GPU memory allocation while training a GPT-2 model on a GPU with 24 GB of VRAM. Despite having a substantial amount of available memory, I’m receiving the following error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.68 GiB total capacity; 18.17 GiB already allocated; 64.62 MiB free; 18.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Here are the specifications of my setup and the model training:
GPU: NVIDIA GPU with 24 GB VRAM Model: GPT-2 with approximately 3 GB in size and 800 parameters of 32-bit each Training Data: 36,000 training examples with vector length of 600 Training Configuration: 5 epochs, batch size of 16, and fp16 enabled These are my calculations:
Model Size:
GPT-2 model: ~3 GB
Gradients:
Gradients are typically of the same size as the model’s parameters.
Batch Size and Training Examples:
Batch Size: 16
Training Examples: 36,000
Vector Length: 600
Memory Allocation per Batch:
Model: 3 GB (unchanged per batch)
Gradients: 3 GB (unchanged per batch)
Input Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch
Output Data: 16 x 600 (vector length) x 4 bytes (assuming each value is a 32-bit float) = 37.5 KB per batch
Based on the above calculations, the memory allocation per batch for my scenario would be approximately:
Model: 3 GB
Gradients: 3 GB
Input and Output Data: 75 KB
I would appreciate any insights or suggestions on how to resolve this issue. Thank you in advance for your assistance!
Upvotes: 2
Views: 4285
Reputation: 1985
Usually this issue is caused by processes using CUDA without flushing memory. If you don't have any process running, the most effective way is to identify them and kill them.
From command line, run:
nvidia-smi
If you have not installed it, you can do it with the following command:
sudo apt-get install -y nvidia-smi
It will print something like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:18:00.0 Off | 0 |
| N/A 32C P0 37W / 250W | 11480MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 33W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:86:00.0 Off | 0 |
| N/A 53C P0 41W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:AF:00.0 Off | 0 |
| N/A 31C P0 35W / 250W | 10200MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 29142 C /usr/bin/python3 11477MiB |
| 1 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 2 N/A N/A 29142 C /usr/bin/python3 10197MiB |
| 3 N/A N/A 29142 C /usr/bin/python3 10197MiB |
+-----------------------------------------------------------------------------+
At the bottom of the print, you will find the Processes that are using the GPU(s) with their PIDs. Assuming you are using Linux, you can kill them with the following command, by replacing ProcessPID
with the actual PID of your process (again, be sure all processes have reached the end):
kill ProcessPID
If this does not work, try:
kill -9 ProcessPID
Upvotes: 1