Reputation: 3277
I wrote this snippet below to try and understand what's going on with these hooks.
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(10,5)
self.fc2 = nn.Linear(5,1)
self.fc1.register_forward_hook(self._forward_hook)
self.fc1.register_backward_hook(self._backward_hook)
def forward(self, inp):
return self.fc2(self.fc1(inp))
def _forward_hook(self, module, input, output):
print(type(input))
print(len(input))
print(type(output))
print(input[0].shape)
print(output.shape)
print()
def _backward_hook(self, module, grad_input, grad_output):
print(type(grad_input))
print(len(grad_input))
print(type(grad_output))
print(len(grad_output))
print(grad_input[0].shape)
print(grad_input[1].shape)
print(grad_output[0].shape)
print()
model = Model()
out = model(torch.tensor(np.arange(10).reshape(1,1,10), dtype=torch.float32))
out.backward()
Produces output
<class 'tuple'>
1
<class 'torch.Tensor'>
torch.Size([1, 1, 10])
torch.Size([1, 1, 5])
<class 'tuple'>
2
<class 'tuple'>
1
torch.Size([1, 1, 5])
torch.Size([5])
torch.Size([1, 1, 5])
You can also follow the CNN example here. In fact, it's needed to understand the rest of my question.
I have a few questions:
I would normally think that grad_input
(backward hook) should be the same shape as output
(forward hook) because when we go backwards, the direction is reversed. But the CNN example seems to indicate otherwise. I'm still a bit confused. Which way around is it?
Why are grad_input[0]
and grad_output[0]
the same shape on my Linear
layer here? Regardless of the answer to my question 1, at least one of them should be torch.Size([1, 1, 10])
right?
What's with the second element of the tuple grad_input
? In the CNN case I copy pasted the example and did print(grad_input[1].size())
with output torch.Size([20, 10, 5, 5])
. So I presume it's the gradients of the weights. I also ran print(grad_input[2].size())
and got torch.Size([20])
. So it seemed clear I was looking at the gradients of the biases. But then in my Linear
example grad_input
is length 2, so I can only access up to grad_input[1]
, which seems to be giving me the gradients of the biases. So then where are the gradients of the weights?
In summary, there are two apparent contradictions between the behaviour of the backwards hook in the cases of Conv2d
and `Linear' modules. This has left me totally confused about what to expect with this hook.
Thanks for your help!
Upvotes: 9
Views: 14147
Reputation: 149
I think the problem explained by Piyush Singh with the Linear layers still persists in 2024. As a solution, I think one can do the following to be able to manipulate backward gradients of Linear Layer's weight and bias:
import torch
from torch import nn
import numpy as np
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.fc1 = nn.Linear(10,5)
self.fc2 = nn.Linear(5,1)
self.fc1.register_forward_hook(self._forward_hook)
self.fc1.register_backward_hook(self._backward_hook)
self.fc1.weight.register_hook(self.bkwrd_hook)
self.fc1.bias.register_hook(self.bkwrd_hook)
def forward(self, inp):
return self.fc2(self.fc1(inp))
def _forward_hook(self, module, input, output):
print(type(input))
print(len(input))
print(type(output))
print(input[0].shape)
print(output.shape)
print()
def _backward_hook(self, module, grad_input, grad_output):
print(type(grad_input))
print(len(grad_input))
print(type(grad_output))
print(len(grad_output))
print(grad_input[0].shape)
# print(grad_input[1].shape)
print(grad_output[0].shape)
print()
def bkwrd_hook(self, grad):
print(f'tensor backward hook: {grad}, {grad.shape}')
return grad
model = Model()
out = model(torch.tensor(np.arange(10).reshape(1,1,10), dtype=torch.float32))
out.backward()
Upvotes: 0
Reputation: 2972
I would normally think that grad_input (backward hook) should be the same shape as output
grad_input
contains gradient (of whatever tensor the backward
has been called on; normally it is the loss tensor when doing machine learning, for you it is just the output of the Model
) wrt input
of the layer. So it is the same shape as input
. Similarly grad_output
is the same shape as output
of the layer. This is also true for the CNN example you have cited.
Why are grad_input[0] and grad_output[0] the same shape on my Linear layer here? Regardless of the answer to my question 1, at least one of them should be torch.Size([1, 1, 10]) right?
Ideally the grad_input
should contain the gradients wrt the input of the layer and wrt the weights and the biases of the layer. That is the behaviour you see if you use the following backward hook for the CNN example:
def _backward_hook(module, grad_input, grad_output):
for i, inp in enumerate(grad_input):
print("Input #", i, inp.shape)
However this does not happen with the Linear
layer. This is because of a bug. Top comment reads:
module hooks are actually registered on the last function that the module has created
So what really might be happening in the backend (my guess) is that it is calculating Y=((W^TX)+b)
. You can see that it is the adding of bias that is the last operation. So for that operation there is one input of shape (1,1,5) and the bias term has shape (5). These two (gradient wrt these actually) form your tuple grad_input
. The result of the addition (gradient wrt it actually) is stored in grad_output
which is of shape (1,1,5)
What's with the second element of the tuple grad_input
As answered above, it is just the gradient wrt whatever "layer parameters" gradients is calculated on; normally the weights/biases (whatever applicable) of that last operation.
Upvotes: 11