user25004
user25004

Reputation: 2048

In Pytorch, quantity.backward() computes the gradient of quantity wrt which of the parameters?

The backward method computes the gradient wrt to which parameters? All of the params with requires_grad having True value?

Interestingly, in Pytorch

  1. computing gradients

and

  1. loading the optimizer that updates parameters based on gradients

need different informations about the identity of parameters of interest to be able to work.

The first one seem to know which parameters to compute the gradient for. The second one needs the parameters to be mentioned to it. See the code below.

quantity.backward() 
optim = torch.SGD(model.parameters())
optim.step()

How is that?

Why backward does not need the model.parameters()?

Would it not be more efficient to mention the specific subset of parameters?

Upvotes: 0

Views: 114

Answers (1)

KonstantinosKokos
KonstantinosKokos

Reputation: 3453

Computing quantity requires constructing a 2-sorted graph with nodes being either tensors or differentiable operations on tensors (a so-called computational graph). Under the hood, pytorch keeps track of this graph for you. When you call quantity.backward(), you're asking pytorch to perform an inverse traversal of the graph, from the output to the inputs, using the derivative of each operation encountered rather the operation itself. Leaf tensors that are flagged as requiring gradients accumulate the gradients computed by backward.

An optimizer is a different story: it simply implements an optimization strategy on a set of parameters, hence it needs to know which parameters you want it to be optimizing. So quantity.backward() computes gradients, optim.step() uses these gradients to perform on a optimization step, updating the parameters contained in model.

As for efficiency, I don't see any argument in favor of specifying parameters in the backward pass (what would the semantics of that be?). If what you'd want is to avoid traversal of parts of the graph in backward mode, pytorch will do it automagically for you if you remember:

  • you can mark leaf tensors as not requiring grad
  • a non-leaf tensor -- the output of some operation f(x1,...xN) -- requires grad if at least one of x1...xN requires grad
  • a tensor that doesn't require grad blocks backward traversal, ensuring no unnecessary computation

Upvotes: 2

Related Questions