How to save memory using half precision while keeping the original weights in single?

Question

I'm trying to save memory while training a model that uses single precision weights by doing the calculations in half precision.

I tried using autocast, and the model does prediction in half precision as it should. However the gradient produced is still in single precision. This ruins both performance and memory savings. Is there any way to instruct torch to calculate grads in half precision and use those to update the original single precision weights?

import torch

class KekNet (torch.nn.Module):
    def __init__(self):
        super(KekNet, self).__init__()
        self.layer1 = torch.nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), dtype=torch.float32)
    
    def forward(self, x, features=False):
        return self.layer1(x)

device = torch.device("cuda")


# HALF-DATA AUTOCAST

net = KekNet().to(device)

loss_l2  = torch.nn.MSELoss(reduction='none')
g_params = [{'params': net.parameters(), 'weight_decay': 0}]
optimizerG = torch.optim.RMSprop(g_params, lr=3e-5, alpha=0.99, eps=1e-07, weight_decay=0)
schedulerG = torch.optim.lr_scheduler.CosineAnnealingLR(optimizerG, T_max=300)

X = torch.randn((40,3,555,555), dtype=torch.float16, device =device)

with torch.autocast(device_type='cuda', dtype=torch.float16):
    Y_h=net(X)

Y = torch.randn_like(Y_h)
loss = loss_l2(Y_h, Y).mean()

loss.backward()

print(f"-autocast
data precision: {X.dtype}
pred precision: {Y_h.dtype}
grad precision: {net.layer1.weight.grad.dtype}
")

optimizerG.step()
schedulerG.step()

results in following:

data precision: torch.float16
pred precision: torch.float16
grad precision: torch.float32

How to save memory using half precision while keeping the original weights in single?

Answers (1)

Related Questions