Reputation: 337
Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE with the expected equation to which I want to fit my perceptron.
Now my doubt is: should I use grad(loss)
or should I do loss.backward()
for backpropagation to calculate and update my gradients?
I understand that while using loss.backward() I have to wrap my tensors with Variable and have to set the requires_grad = True for the variables w.r.t which I want to take the gradient of my loss.
So my questions are :
grad(loss)
also requires any such explicit parameter to identify the variables for gradient computation?It would be better if you could explain the practical implications of both approaches because whenever I try to find it online I am just bombarded with a lot of stuff that isn't much relevant to my project.
Upvotes: 18
Views: 11977
Reputation: 40648
TLDR; Both are different interfaces for gradient computation: torch.autograd.grad
is non-mutable while torch.autograd.backward
is.
The torch.autograd
module is the automatic differentiation package for PyTorch. As described in the documentation it only requires minimal change to the code base to be used:
you only need to declare
Tensor
s for which gradients should be computed with therequires_grad=True
keyword.
The two main functions torch.autograd
provides for gradient computation are torch.autograd.backward
and torch.autograd.grad
:
torch.autograd.backward (source) |
torch.autograd.grad (source) |
|
---|---|---|
Description | Computes the sum of gradients of given tensors with respect to graph leaves. | Computes and returns the sum of gradients of outputs with respect to the inputs. |
Header | torch.autograd.backward( tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None, inputs=None) |
torch.autograd.grad( outputs, inputs, grad_outputs=None, retain_graph=None, create_graph=False, only_inputs=True, allow_unused=False) |
Parameters | - tensors – Tensors of which the derivative will be computed.- grad_tensors – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False , the graph used to compute the grad will be freed. [...]- inputs – Inputs w.r.t. which the gradient be will be accumulated into .grad . All other Tensors will be ignored. If not provided, the gradient is accumulated into all the leaf Tensors that were used [...]. |
- outputs – outputs of the differentiated function.- inputs – Inputs w.r.t. which the gradient will be returned (and not accumulated into .grad ).- grad_outputs – The "vector" in the Jacobian-vector product, usually gradients w.r.t. each element of corresponding tensors.- retain_graph – If False , the graph used to compute the grad will be freed. [...]. |
In terms of high-level usage, you can look at torch.autograd.grad
as a non-mutable function. As mentioned in the documentation table above, it will not accumulate the gradients on the grad
attribute but instead return the computed partial derivatives. In contrast torch.autograd.backward
will be able to mutate the tensors by updating the grad
attribute of leaf nodes, the function won't return any value. In other words, the latter is more suitable when computing gradients for a large number of parameters.
In the following, we will take two inputs (x1
and, x2
), calculate a tensor y
with them, and then compute the partial derivatives of the result w.r.t both inputs, i.e. dL/dx1
and dL/dx2
:
>>> x1 = torch.rand(1, requires_grad=True)
>>> x2 = torch.rand(1, requires_grad=True)
>>> x1, x2
(tensor(0.3939, grad_fn=<UnbindBackward>),
tensor(0.7965, grad_fn=<UnbindBackward>))
Inference:
>>> y = x1**2 + 5*x2
>>> y
tensor(4.1377, grad_fn=<AddBackward0>)
Since y
was computed using tensor(s) requiring gradients (i.e. with requires_grad=True
) - *outside of a torch.no_grad
context. It will have a grad_fn
function attached. This callback is used to backpropagate onto the computation graph to compute the gradients of preceding tensor nodes.
torch.autograd.grad
:
Here we provide torch.ones_like(y)
as the grad_outputs
.
>>> torch.autograd.grad(y, (x1, x2), torch.ones_like(y))
(tensor(0.7879), tensor(5.))
The above output is a tuple containing the two partial derivatives w.r.t. to the provided inputs respectively in order of appearance, i.e. dL/dx1
and dL/dx2
.
This corresponds to the following computation:
# dL/dx1 = dL/dy * dy/dx1 = grad_outputs @ 2*x1
# dL/dx2 = dL/dy * dy/dx2 = grad_outputs @ 5
torch.autograd.backward
: in contrast, it will mutate the provided tensors by updating the grad
of the tensors that have been used to compute the output tensor and that require gradients. It is equivalent to the torch.Tensor.backward
API. Here, we go through the same example by defining x1
, x2
, and y
again. We call backward
:
>>> # y.backward(torch.ones_like(y))
>>> torch.autograd.backward(y, torch.ones_like(y))
None
Then you can retrieve the gradients on x1.grad
and x2.grad
:
>>> x1.grad, x2.grad
(tensor(0.7879), tensor(5.))
In conclusion: both perform the same operation. They are two different interfaces to interact with the autograd
library and perform gradient computations. The latter, torch.autograd.backward
(equivalent to torch.Tensor.backward
), is generally used in neural networks training loops to compute the partial derivative of the loss w.r.t each one of the model's parameters.
You can read more about how torch.autograd.grad
works by reading through this other answer I made on: Meaning of grad_outputs in PyTorch's torch.autograd.grad.
Upvotes: 24
Reputation: 10865
In addition to Ivan's answer, having torch.autograd.grad
not accumulating gradients into .grad
can avoid racing conditions in multi-thread scenarios.
Quoting PyTorch doc https://pytorch.org/docs/stable/notes/autograd.html#non-determinism
If you are calling backward() on multiple thread concurrently but with shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if you are using the multithreading approach to drive the whole training process but using shared parameters, user who use multithreading should have the threading model in mind and should expect this to happen. User could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid non-determinism.
implementation details https://github.com/pytorch/pytorch/blob/7e3a694b23b383e38f5e39ef960ba8f374d22404/torch/csrc/autograd/functions/accumulate_grad.h
Upvotes: 3