Reputation: 367
According to the docs, when we call the backward function to the tensor if the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient.
import torch
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)
F = a * b
F.backward(gradient=torch.tensor([1.,1.]))
print(a.grad)
Output: tensor([20., 20.])
Now scaling the external gradient:
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)
F = a * b
F.backward(gradient=torch.tensor([2.,2.])) #modified
print(a.grad)
Output: tensor([40., 40.])
So, passing the gradient argument to backward seems to scale the gradients.
Also, by default F.backward()
is F.backward(gradient=torch.Tensor([1.]))
Apart from scaling the grad value how does the gradient
parameter passed to the backward function helps to compute the derivatives when we have a non-scalar tensor?
Why can't PyTorch calculate the derivative implicitly without asking explicit gradient parameter as it did for the scalar tensor?
Upvotes: 8
Views: 1611
Reputation: 1088
If the gradient argument defaults to torch.ones(F.shape)
, in this case torch.tensor([1.,1.])
, a rookie may accidentally do F.backward()
and get a.grad = tensor([20., 20.])
. But maybe what this rookie really want is F.mean().backward()
, which is equivalent to F.backward(torch.tensor([.5, .5]))
, which gives a.grad = tensor([10., 10.])
.
This could get nastier if the rookie really wants a BCE loss like torch.nn.functional.binary_cross_entropy(torch.softmax(F, 0), torch.tensor([0., 1]))
. This design choice makes it easier to debug.
My guesses.
Upvotes: 0
Reputation: 111
The values in the gradient
parameter is expected to be the derivative of the final loss w.r.t the current tensor.
This post explains the idea very well: https://stackoverflow.com/a/47026836/1500108.
Upvotes: 0
Reputation: 12867
It's because PyTorch is calculating the jacobian product. In case of scalar value, .backward()
w/o parameters is equivalent to .backward(torch.tensor(1.0))
.
That's why you need to provide the tensor with which you want to calculate the product. Read more about automatic differentiation.
Upvotes: 1