Reputation: 454
I'm trying to compute the gradient of my loss function with respect to my model parameters in PyTorch.
That is, let u(x; θ)
be the model, where x
is the input (in R^n
) and θ
are the model parameters. I'm trying to compute du/dθ
.
For a "simple" loss function, this is not a problem, but my loss function depends on the gradient of the model with respect to its inputs (i.e., du/dx
). When I attempt to do this, I'm met with the following error message: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Here is a minimal example to illustrate the issue:
import torch
import torch.nn as nn
from torch.autograd import grad
model = nn.Sequential(nn.Linear(1, 10), nn.Tanh(), nn.Linear(10, 1))
def loss1(x, u):
return torch.mean(u)
def loss2(x, u):
d_u_x = grad(u, x, torch.ones_like(u), retain_graph=True, create_graph=True)[0]
return torch.mean(d_u_x)
x = torch.randn(10, 1)
x.requires_grad_()
u = model(x)
loss = loss2(x, u)
d_loss_params = grad(loss, model.parameters(), retain_graph=True)
If I change the second to last line to read loss = loss1(x, u)
things work as expected.
Update: it appears to be working if I set bias=False
for the nn.Linear
s. OK, that makes some sense since the bias is not trainable. But that begs the question, how do I extract only the trainable parameters to use in the gradient computation?
Upvotes: 0
Views: 23
Reputation: 454
This was solved by passing allow_unused=True
and materialize_grads=True
to grad
. That is:
d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=True, materialize_grads=True)
See discussion on https://discuss.pytorch.org/t/gradient-of-loss-that-depends-on-gradient-of-network-with-respect-to-parameters/217275 for more info.
Upvotes: 0
Reputation: 5473
Short answer:
The bias vector of your final layer plays no part in computing d_loss_params
, hence the error. You can get around this by using allow_unused=True
. This will result in the final bias vector having None
for the gradient.
d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=False)
Long answer:
Take your example network with bias=True
for both linear layers:
model = nn.Sequential(nn.Linear(1, 10, bias=True), nn.Tanh(), nn.Linear(10, 1, bias=False))
Say our input is x
. Then we have:
x1 = a1 * x + b1 # first linear layer
x2 = tanh(x1) # tanh activation
x3 = a1 * x2 + b2 # second linear layer
In your loss, you compute the gradient of x3
with respect to x
. Evaluate the gradient using the chain rule:
d(x3)/d(x) = (d(x3)/d(x2)) * (d(x2)/d(x1)) * (d(x1)/d(x))
d(x3)/d(x2) = a2
d(x2)/d(x1) = 1 - x2**2
d(x1)/d(x) = a1
d(x3)/d(x) = a2 * (1 - x2**2) * a1
You can also verify this. The result of loss2
will be equal to (model[2].weight * (1 - x2.pow(2)) * model[0].weight.T).sum(1).mean()
Now substitute x2
in the expression:
# substitute x2 = tanh(x1) = tanh(a1 * x + b1)
d(x3)/d(x) = a2 * (1 - tanh(a1 * x + b1)**2) * a1
From this, we can see that d(x3)/d(x)
is computed from a2
(weight of the final linear layer), a1
(weight of first linear layer), b1
(bias of first linear layer) and x
(input value). Notably, the bias of the final linear layer b2
does not appear in the expression.
Now we can see what happens when we try to compute d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=False)
Here loss = (d(x3)/d(x)).mean() = (a2 * (1 - tanh(a1 * x + b1)**2) * a1).mean()
. Our model.parameters()
list contains a1
, b1
, a2
, and b2
. Since b2
does not contribute to our loss value, we cannot compute the gradient of the loss with respet to b2
, hence the error.
The error can be avoided by setting allow_unused=False
, which will return None
values for any parameters not participating in the computation of the output value.
Upvotes: 0