Abishek Bashyal
Abishek Bashyal

Reputation: 367

Why do we need to pass the gradient parameter to the backward function in PyTorch?

According to the docs, when we call the backward function to the tensor if the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient.

import torch
a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([1.,1.])) 

print(a.grad)

Output: tensor([20., 20.])

Now scaling the external gradient:

a = torch.tensor([10.,10.],requires_grad=True)
b = torch.tensor([20.,20.],requires_grad=True)

F = a * b
F.backward(gradient=torch.tensor([2.,2.])) #modified

print(a.grad)

Output: tensor([40., 40.])

So, passing the gradient argument to backward seems to scale the gradients.
Also, by default F.backward() is F.backward(gradient=torch.Tensor([1.]))

Apart from scaling the grad value how does the gradient parameter passed to the backward function helps to compute the derivatives when we have a non-scalar tensor?
Why can't PyTorch calculate the derivative implicitly without asking explicit gradient parameter as it did for the scalar tensor?

Upvotes: 8

Views: 1611

Answers (3)

Leo
Leo

Reputation: 1088

If the gradient argument defaults to torch.ones(F.shape), in this case torch.tensor([1.,1.]), a rookie may accidentally do F.backward() and get a.grad = tensor([20., 20.]). But maybe what this rookie really want is F.mean().backward(), which is equivalent to F.backward(torch.tensor([.5, .5])), which gives a.grad = tensor([10., 10.]).

This could get nastier if the rookie really wants a BCE loss like torch.nn.functional.binary_cross_entropy(torch.softmax(F, 0), torch.tensor([0., 1])). This design choice makes it easier to debug.

My guesses.

Upvotes: 0

gaolei
gaolei

Reputation: 111

The values in the gradient parameter is expected to be the derivative of the final loss w.r.t the current tensor.

This post explains the idea very well: https://stackoverflow.com/a/47026836/1500108.

Upvotes: 0

Harshit Kumar
Harshit Kumar

Reputation: 12867

It's because PyTorch is calculating the jacobian product. In case of scalar value, .backward() w/o parameters is equivalent to .backward(torch.tensor(1.0)).

That's why you need to provide the tensor with which you want to calculate the product. Read more about automatic differentiation.

Upvotes: 1

Related Questions