Reputation: 11
I am building a Bayesian neural network, and I need to manually calculate the gradient of each neural network output and update the network parameters.
For example, in the following network, how can I get the gradient of neural network output ag and bg to the neural network parameters phi, it's --∂ag/∂phi and ∂bg/∂phi--, and update the parameters respectively.
class encoder(torch.nn.Module):
def __init__(self, _l_dim, _hidden_dim, _fg_dim):
super(encoder, self).__init__()
self.hidden_nn = nn.Linear(_l_dim, _hidden_dim)
self.ag_nn = nn.Linear(_hidden_dim, _fg_dim)
self.bg_nn = nn.Linear(_hidden_dim, _fg_dim)
def forward(self, _lg):
ag = self.ag_nn(self.hidden_nn(_lg))
bg = self.bg_nn(self.hidden_nn(_lg))
return ag, bg
Upvotes: 1
Views: 1641
Reputation: 40768
You are looking to compute the gradients of the parameters corresponding to each loss term. Given a model f
, parametrized by θ_ag
and θ_bg
. These two parameter sets might overlap: that's the case here since you have a shared hidden layer. Then f(x; θ_ag, θ_bg)
will output a pair of elements ag
and bg
. Your loss function is defined as L = L_ag + L_bg
.
The terms you want to compute are dL_ag/dθ_ag
and dL_bg/dθ_bg
, which is different from what you would typically get with a single backward call: which gives dL/dθ_ag
and dL/dθ_bg
.
In order to compute those terms, you will require two backward passes, after both of them we will compute the respective terms. Before starting, here are a couple things you need to do:
It will be useful to make θ_ag
and θ_bg
available to us. You can, for example, add those two functions in your model definition:
def ag_params(self):
return [*self.hidden_nn.parameters(), *self.ag_nn.parameters()]
def bg_params(self):
return [*self.hidden_nn.parameters(), *self.bg_nn.parameters()]
Assuming you have a loss function loss_fn
which outputs two scalar values L_ab
and L_bg
. Here is a mockup for loss_fn
:
def loss_fn(ab, bg):
return ab.mean(), bg.mean()
We will need an optimizer to zero the gradient out, here SGD:
optim = torch.optim.SGD(model.parameters(), lr=1e-3)
Then we can start applying the following method:
Do an inference to compute ag
, and bg
as well as L_ag
, and L_bg
:
>>> ag, bg = model(x)
>>> L_ag, L_bg = loss_fn(ag, bg)
Backpropagate once on L_ag
, while retaining the graph:
>>> L_ag.backward(retain_graph=True)
At this point, we can collect dL_ag/dθ_ag
on the parameters contained in θ_ag
. For example, you could pick the norm of the different parameter gradients using the ag_params
function:
>>> pgrad_ag = torch.stack([p.grad.norm()
for p in m.ag_params() if p.grad is not None])
Next we can proceed with a second backpropagation, this time on L_bg
. But before that, we need to clear the gradients so dL_ag/dθ_ag
doesn't pollute the next computation:
>>> optim.zero_grad()
Backpropagation on L_bg
:
>>> L_bg.backward(retain_graph=True)
Here again, we collect the gradient norms, i.e. the gradient of dL/dθ_bg
, this time using the bg_params
function:
>>> pgrad_bg = torch.stack([p.grad.norm()
for p in m.bg_params() if p.grad is not None])
Now you have pgrad_ag
and pgrad_bg
which correspond to the gradient norms of dL/dθ_bg
, and dL/dθ_bg
respectively.
Upvotes: 0
Reputation: 5289
If you want do compute dx/dW, you can use autograd for that. torch.autograd.grad(x, W, grad_outputs=torch.ones_like(x), retain_graph=True)
. Does that actually accomplish what you're trying to do?
Upvotes: 0