Reputation: 122112
Starting with zero usage:
>>> import gc
>>> import GPUtil
>>> import torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 0% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Then I create a big enough tensor and hog the memory:
>>> x = torch.rand(10000,300,200).cuda()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Then I tried several ways to see if the tensor disappears.
Attempt 1: Detach, send to CPU and overwrite the variable
No, doesn't work.
>>> x = x.detach().cpu()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 2: Delete the variable
No, this doesn't work either
>>> del x
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 26% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 3: Use the torch.cuda.empty_cache()
function
Seems to work, but it seems that there are some lingering overheads...
>>> torch.cuda.empty_cache()
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 4: Maybe clear the garbage collector.
No, 5% is still being hogged
>>> gc.collect()
0
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
Attempt 5: Try deleting torch
altogether (as if that would work when del x
didn't work -_- )
No, it doesn't...*
>>> del torch
>>> GPUtil.showUtilization()
| ID | GPU | MEM |
------------------
| 0 | 0% | 5% |
| 1 | 0% | 0% |
| 2 | 0% | 0% |
| 3 | 0% | 0% |
And then I tried to check gc.get_objects()
and it looks like there's still quite a lot of odd THCTensor
stuff inside...
Any idea why is the memory still in use after clearing the cache?
Upvotes: 39
Views: 28891
Reputation: 783
From the PyTorch docs:
Memory management
PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use
memory_allocated()
andmax_memory_allocated()
to monitor memory occupied by tensors, and usememory_cached()
andmax_memory_cached()
to monitor memory managed by the caching allocator. Callingempty_cache()
releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.
I bolded a part mentioning nvidia-smi, which as far as I know is used by GPUtil.
Upvotes: 17
Reputation: 908
thanks for sharing this! I am running in the same problem and I used your example to debug. Basically, my findings are:
Here's some code to reproduce the experiment:
import gc
import torch
def _get_less_used_gpu():
from torch import cuda
cur_allocated_mem = {}
cur_cached_mem = {}
max_allocated_mem = {}
max_cached_mem = {}
for i in range(cuda.device_count()):
cur_allocated_mem[i] = cuda.memory_allocated(i)
cur_cached_mem[i] = cuda.memory_reserved(i)
max_allocated_mem[i] = cuda.max_memory_allocated(i)
max_cached_mem[i] = cuda.max_memory_reserved(i)
print(cur_allocated_mem)
print(cur_cached_mem)
print(max_allocated_mem)
print(max_cached_mem)
min_all = min(cur_allocated_mem, key=cur_allocated_mem.get)
print(min_all)
return min_all
x = torch.rand(10000,300,200, device=0)
# see memory usage
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB
# try delete with empty_cache()
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB
# try delete with gc.collect()
gc.collect()
_get_less_used_gpu()
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB
# try del + gc.collect()
del x
gc.collect()
_get_less_used_gpu()
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: 3416MiB
# try empty_cache() after deleting
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: 0, 1: 0, 2: 0, 3: 0}
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: **1126MiB**
# re-create obj and try del + empty_cache()
x = torch.rand(10000,300,200, device=0)
del x
torch.cuda.empty_cache()
_get_less_used_gpu()
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: **0**, 1: 0, 2: 0, 3: 0}
>{0: 2400000000, 1: 0, 2: 0, 3: 0}
>{0: 2401239040, 1: 0, 2: 0, 3: 0}
> *nvidia-smi*: **1126MiB**
Nonetheless, this approach only applies when one knows exactly which variables are holding memory...which is not always the case when one trains deep learning modes I guess, especially when using third-party libraries.
Upvotes: 11
Reputation: 7209
It looks like PyTorch's caching allocator reserves some fixed amount of memory even if there are no tensors, and this allocation is triggered by the first CUDA memory access
(torch.cuda.empty_cache()
deletes unused tensor from the cache, but the cache itself still uses some memory).
Even with a tiny 1-element tensor, after del
and torch.cuda.empty_cache()
, GPUtil.showUtilization(all=True)
reports exactly the same amount of GPU memory used as for a huge tensor (and both torch.cuda.memory_cached()
and torch.cuda.memory_allocated()
return zero).
Upvotes: 19