Reputation: 155
I am training vgg16 model from scratch on AWS EC2 Deep Learning AMI machine (Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-1054-aws x86_64v)) with Python3 (CUDA 10.1 and Intel MKL) (Pytorch 1.3.1) and facing below error while updating model parameters.
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.17 GiB total capacity; 10.76 GiB already allocated; 4.81 MiB free; 119.92 MiB cached)
Code for updating parameters:
def _update_fisher_params(self, current_ds, batch_size, num_batch):
dl = DataLoader(current_ds, batch_size, shuffle=True)
log_liklihoods = []
for i, (input, target) in enumerate(dl):
if i > num_batch:
break
output = F.log_softmax(self.model(input.cuda().float()), dim=1)
log_liklihoods.append(output[:, target])
log_likelihood = torch.cat(log_liklihoods).mean()
grad_log_liklihood = autograd.grad(log_likelihood, self.model.parameters())
_buff_param_names = [param[0].replace('.', '__') for param in self.model.named_parameters()]
for _buff_param_name, param in zip(_buff_param_names, grad_log_liklihood):
self.model.register_buffer(_buff_param_name+'_estimated_fisher', param.data.clone() ** 2)
After debugging: log_liklihoods.append(output[:, target])
line throws error after 157 iterations
I have the required memory but it does not allocate, I am not getting why updating the gradients is causing the memory problem, as gradients should be de-referenced and released automatically on each iteration. Any idea?
I have tried following solutions but no luck.
Machine Specs:
Upvotes: 3
Views: 666
Reputation: 155
Finally I solved the memory problem! I realized that in each iteration I put the input data in a new tensor, and pytorch generates a new computation graph. That causes the used RAM to grow forever. Then I used .detach() function, and the RAM always stays at a low level.
self.model(input.cuda().float()).detach().requires_grad_(True)
Upvotes: 2