user118967
user118967

Reputation: 5722

Meaning of grad_outputs in PyTorch's torch.autograd.grad

I am having trouble understanding the conceptual meaning of the grad_outputs option in torch.autograd.grad.

The documentation says:

grad_outputs should be a sequence of length matching output containing the “vector” in Jacobian-vector product, usually the pre-computed gradients w.r.t. each of the outputs. If an output doesn’t require_grad, then the gradient can be None).

I find this description quite cryptic. What exactly do they mean by Jacobian-vector product? I know what the Jacobian is, but not sure about what product they mean here: element-wise, matrix product, something else? I can't tell from my example below.

And why is "vector" in quotes? Indeed, in the example below I get an error when grad_outputs is a vector, but not when it is a matrix.

>>> x = torch.tensor([1.,2.,3.,4.], requires_grad=True)
>>> y = torch.outer(x, x)

Why do we observe the following output; how was it computed?

>>> y
tensor([[ 1.,  2.,  3.,  4.],
        [ 2.,  4.,  6.,  8.],
        [ 3.,  6.,  9., 12.],
        [ 4.,  8., 12., 16.]], grad_fn=<MulBackward0>)

>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(y))
(tensor([20., 20., 20., 20.]),)

However, why this error?

>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(x))  

RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([4]) and output[0] has a shape of torch.Size([4, 4]).

Upvotes: 11

Views: 8330

Answers (1)

Ivan
Ivan

Reputation: 40668

If we take your example we have function f which takes as input x shaped (n,) and outputs y = f(x) shaped (n, n). The input is described as column vector [x_i]_i for i ∈ [1, n], and f(x) is defined as matrix [y_jk]_jk = [x_j*x_k]_jk for j, k ∈ [1, n]².

It is often useful to compute the gradient of the output with respect to the input (or sometimes w.r.t the parameters of f, there are none here). In the more general case though, we are looking to compute dL/dx and not just dy/dx, where dL/dx is the partial derivative of L, computed from y, w.r.t. x.

The computation graph looks like:

x.grad = dL/dx <-------   dL/dy y.grad
                dy/dx
       x       ------->    y = x*xT

Then, if we look at dL/dx, which is, via the chain rule equal to dL/dy*dy/dx. We have, looking at the interface of torch.autograd.grad, the following correspondences:

  • outputs <-> y,
  • inputs <-> x, and
  • grad_outputs <-> dL/dy.

Looking at the shapes: dL/dx should have the same shape as x (dL/dx can be referred to as the 'gradient' of x), while dy/dx, the Jacobian matrix, would be 3-dimensional. On the other hand dL/dy, which is the incoming gradient, should have the same shape as the output, i.e., y's shape.

We want to compute dL/dx = dL/dy*dy/dx. If we look more closely, we have

dy/dx = [dy_jk/dx_i]_ijk for i, j, k ∈ [1, n]³

Therefore,

dL/dx = [dL/d_x_i]_i, i ∈ [1,n]
      = [sum(dL/dy_jk * d(y_jk)/dx_i over j, k ∈ [1, n]²]_i, i ∈ [1,n]

Back to your example, it means for a given i ∈ [1, n]: dL/dx_i = sum(dy_jk/dx_i) over j, k ∈ [1,n]². And dy_jk/dx_i = f(x_j*x_k)/dx_i will equal x_j if i = k, x_k if i = j, and 2*x_i if i = j = k (because of the squared x_i). This being said matrix y is symmetric... So the result comes down to 2*sum(x_i) over i ∈ [1, n]

This means dL/dx is the column vector [2*sum(x)]_i for i ∈ [1, n].

>>> 2*x.sum()*torch.ones_like(x)
tensor([20., 20., 20., 20.])

Stepping back look at this other graph example, here adding an additional operation after y:

  x   ------->  y = x*xT  -------->  z = y²

If you look at the backward pass on this graph, you have:

dL/dx <-------   dL/dy    <--------  dL/dz
        dy/dx              dz/dy 
  x   ------->  y = x*xT  -------->  z = y²

With dL/dx = dL/dy*dy/dx = dL/dz*dz/dy*dy/dx which is in practice computed in two sequential steps: dL/dy = dL/dz*dz/dy, then dL/dx = dL/dy*dy/dx.

Upvotes: 18

Related Questions