Reputation: 854
Freezing weights in pytorch for param_groups
setting.
So if one wants to freeze weights during training:
for param in child.parameters():
param.requires_grad = False
the optimizer also has to be updated to not include the non gradient weights:
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=opt.lr, amsgrad=True)
If one wants to use different weight_decay
/ learning rates for bias and weights/this also allows for differing learning rates:
param_groups = [{'params': model.module.bias_parameters(), 'weight_decay': args.bias_decay},
{'params': model.module.weight_parameters(), 'weight_decay': args.weight_decay}]
param_groups
a list of dics is defined and passed into SGD
as follows:
optimizer = torch.optim.Adam(param_groups, args.lr,
betas=(args.momentum, args.beta))
How can this be achieved with freezing individual weights? Running filter over a list of dics or is there a way of adding tensors to the optimizer separately?
Upvotes: 13
Views: 16125
Reputation: 24129
Actually I think you don't have to update the optimizer
. The Parameters
handed over to the optimizer
are just references.
So when you change the requires_grad
flag it will immediately be updated.
But even if that would for some reason not be the case - as soon as you you set the requires_grad
flag to be False
you cannot anymore calculate gradients any new gradients (see at the bottom with None
and zero gradients) for this weight, so the gradient won't change anymore and if you use optimizer.zero_grad()
it will just stay zero
.
So if there is no gradient, then there is also no need to exclude these from the optimizer
. Because without gradient the optimizer
will just do nothing, no matter what learning rate you use.
Here is an small example to show this behaviour:
import torch
import torch.nn as nn
import torch.optim as optim
n_dim = 5
p1 = nn.Linear(n_dim, 1)
p2 = nn.Linear(n_dim, 1)
optimizer = optim.Adam(list(p1.parameters())+list(p2.parameters()))
p2.weight.requires_grad = False
for i in range(4):
dummy_loss = (p1(torch.rand(n_dim)) + p2(torch.rand(n_dim))).squeeze()
optimizer.zero_grad()
dummy_loss.backward()
optimizer.step()
print('p1: requires_grad =', p1.weight.requires_grad, ', gradient:', p1.weight.grad)
print('p2: requires_grad =', p2.weight.requires_grad, ', gradient:', p2.weight.grad)
print()
if i == 1:
p1.weight.requires_grad = False
p2.weight.requires_grad = True
Output:
p1: requires_grad = True , gradient: tensor([[0.8522, 0.0020, 0.1092, 0.8167, 0.2144]])
p2: requires_grad = False , gradient: None
p1: requires_grad = True , gradient: tensor([[0.7635, 0.0652, 0.0902, 0.8549, 0.6273]])
p2: requires_grad = False , gradient: None
p1: requires_grad = False , gradient: tensor([[0., 0., 0., 0., 0.]])
p2: requires_grad = True , gradient: tensor([[0.1343, 0.1323, 0.9590, 0.9937, 0.2270]])
p1: requires_grad = False , gradient: tensor([[0., 0., 0., 0., 0.]])
p2: requires_grad = True , gradient: tensor([[0.0100, 0.0123, 0.8054, 0.9976, 0.6397]])
Here you can see that no gradients are calculated. You may have notice the gradient for p2
is None
at the beginning and later on it is tensor([[0., 0., 0., 0., 0.]])
for p1
instead of None
after deactivating gradients.
This is the case because p1.weight.grad
is just a variable which is modified by backward()
and optimizer.zero_grad()
.
So at the beginning p1.weight.grad
is just initialized with None
, after the gradients are written or accumulated to this variable they won't be cleared automatically. But because optimizer.zero_grad()
is called they are set to zero and stay like this since backward()
cannot anymore calculate new gradients with requires_grad=False
.
You can also change the code in the if
-statement to:
if i == 1:
p1.weight.requires_grad = False
p1.weight.grad = None
p2.weight.requires_grad = True
So once reset to None
they are left untouched and stay None
:
p1: requires_grad = True , gradient: tensor([[0.2375, 0.7528, 0.1501, 0.3516, 0.3470]])
p2: requires_grad = False , gradient: None
p1: requires_grad = True , gradient: tensor([[0.5181, 0.5178, 0.6590, 0.6950, 0.2743]])
p2: requires_grad = False , gradient: None
p1: requires_grad = False , gradient: None
p2: requires_grad = True , gradient: tensor([[0.4797, 0.7203, 0.2284, 0.9045, 0.6671]])
p1: requires_grad = False , gradient: None
p2: requires_grad = True , gradient: tensor([[0.8344, 0.1245, 0.0295, 0.2968, 0.8816]])
I hope this makes sense to you!
Upvotes: 15