Reputation: 389
I’m trying to understand what happens to both RAM and GPU memory when a tensor is sent to the GPU.
In the following code sample, I create two tensors - large tensor arr = torch.Tensor.ones((10000, 10000)) and small tensor c = torch.Tensor.ones(1). Tensor c is sent to GPU inside the target function step which is called by multiprocessing.Pool. In doing so, each child process uses 487 MB on the GPU and RAM usage goes to 5 GB. Note that the large tensor arr is just created once before calling Pool and not passed as an argument to the target function. Ram usage does not explode when everything is on the CPU.
I have the following questions on this example:
I’m sending torch.Tensor.ones(1) to GPU and yet it consumes 487 MB of GPU memory. Does CUDA allocate a minimum amount of memory on the GPU even if the underlying tensor is very small? GPU memory is not a problem for me, and this is just for me to understand how the allocation is done.
The problem lies in RAM usage. Even though I am sending a small tensor to the GPU, it appears as if everything in memory (large tensor arr) is copied for every child process (possibly to pinned memory). So when a tensor is sent to the GPU, what objects are copied to pinned memory? I’m missing something here as it does not make sense to prepare everything to be sent to GPU when I’m only sending a particular object.
Thanks!
from multiprocessing import get_context
import time
import torch
dim = 10000
sleep_time = 2
npe = 4 # number of parallel executions
# cuda
if torch.cuda.is_available():
dev = 'cuda:0'
else:
dev = "cpu"
device = torch.device(dev)
def step(i):
c = torch.ones(1)
# comment the line below to see no memory increase
c = c.to(device)
time.sleep(sleep_time)
if __name__ == '__main__':
arr = torch.ones((dim, dim))
# create list of inputs to be executed in parallel
inp = list(range(npe))
# sleep added before and after launching multiprocessing to monitor the memory consumption
print('before pool') # to check memory with top or htop
time.sleep(sleep_time)
context = get_context('spawn')
with context.Pool(npe) as pool:
print('after pool') # to check memory with top or htop
time.sleep(sleep_time)
pool.map(step, inp)
time.sleep(sleep_time)
Upvotes: 1
Views: 3350
Reputation: 72349
I’m sending
torch.Tensor.ones(1)
to GPU and yet it consumes 487 MB of GPU memory. Does CUDA allocate a minimum amount of memory on the GPU even if the underlying tensor is very small?
The CUDA device runtime reserves memory for all sorts of things at context establishment, some of which are fixed in size and some of which are variable and can be controlled by API calls (see here for some more information). It is completely normal that the first API call which explicitly or lazily establishes a context on the device will produce a jump in GPU memory consumption. In this case I imagine the first tensor creation is triggering this memory overhead allocation. This is a property of the CUDA runtime and not PyTorch or the tensor.
Upvotes: 2