user2978125
user2978125

Reputation: 454

PyTorch gradient of loss (that depends on gradient of network) with respect to parameters

I'm trying to compute the gradient of my loss function with respect to my model parameters in PyTorch.

That is, let u(x; θ) be the model, where x is the input (in R^n) and θ are the model parameters. I'm trying to compute du/dθ.

For a "simple" loss function, this is not a problem, but my loss function depends on the gradient of the model with respect to its inputs (i.e., du/dx). When I attempt to do this, I'm met with the following error message: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Here is a minimal example to illustrate the issue:

import torch
import torch.nn as nn
from torch.autograd import grad

model = nn.Sequential(nn.Linear(1, 10), nn.Tanh(), nn.Linear(10, 1))

def loss1(x, u):
    return torch.mean(u)

def loss2(x, u):
    d_u_x = grad(u, x, torch.ones_like(u), retain_graph=True, create_graph=True)[0]
    return torch.mean(d_u_x)

x = torch.randn(10, 1)
x.requires_grad_()
u = model(x)

loss = loss2(x, u)
d_loss_params = grad(loss, model.parameters(), retain_graph=True)

If I change the second to last line to read loss = loss1(x, u) things work as expected.

Update: it appears to be working if I set bias=False for the nn.Linears. OK, that makes some sense since the bias is not trainable. But that begs the question, how do I extract only the trainable parameters to use in the gradient computation?

Upvotes: 0

Views: 23

Answers (2)

user2978125
user2978125

Reputation: 454

This was solved by passing allow_unused=True and materialize_grads=True to grad. That is:

d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=True, materialize_grads=True)

See discussion on https://discuss.pytorch.org/t/gradient-of-loss-that-depends-on-gradient-of-network-with-respect-to-parameters/217275 for more info.

Upvotes: 0

Karl
Karl

Reputation: 5473

Short answer:

The bias vector of your final layer plays no part in computing d_loss_params, hence the error. You can get around this by using allow_unused=True. This will result in the final bias vector having None for the gradient.

d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=False)

Long answer:

Take your example network with bias=True for both linear layers:

model = nn.Sequential(nn.Linear(1, 10, bias=True), nn.Tanh(), nn.Linear(10, 1, bias=False))

Say our input is x. Then we have:

x1 = a1 * x + b1  # first linear layer
x2 = tanh(x1)     # tanh activation
x3 = a1 * x2 + b2 # second linear layer

In your loss, you compute the gradient of x3 with respect to x. Evaluate the gradient using the chain rule:

d(x3)/d(x) = (d(x3)/d(x2)) * (d(x2)/d(x1)) * (d(x1)/d(x))

d(x3)/d(x2) = a2
d(x2)/d(x1) = 1 - x2**2
d(x1)/d(x) = a1

d(x3)/d(x) = a2 * (1 - x2**2) * a1

You can also verify this. The result of loss2 will be equal to (model[2].weight * (1 - x2.pow(2)) * model[0].weight.T).sum(1).mean()

Now substitute x2 in the expression:

# substitute x2 = tanh(x1) = tanh(a1 * x + b1)
d(x3)/d(x) = a2 * (1 - tanh(a1 * x + b1)**2) * a1

From this, we can see that d(x3)/d(x) is computed from a2 (weight of the final linear layer), a1 (weight of first linear layer), b1 (bias of first linear layer) and x (input value). Notably, the bias of the final linear layer b2 does not appear in the expression.

Now we can see what happens when we try to compute d_loss_params = grad(loss, model.parameters(), retain_graph=True, allow_unused=False)

Here loss = (d(x3)/d(x)).mean() = (a2 * (1 - tanh(a1 * x + b1)**2) * a1).mean(). Our model.parameters() list contains a1, b1, a2, and b2. Since b2 does not contribute to our loss value, we cannot compute the gradient of the loss with respet to b2, hence the error.

The error can be avoided by setting allow_unused=False, which will return None values for any parameters not participating in the computation of the output value.

Upvotes: 0

Related Questions