Reputation: 5722
I am having trouble understanding the conceptual meaning of the grad_outputs
option in torch.autograd.grad
.
The documentation says:
grad_outputs
should be a sequence of length matching output containing the “vector” in Jacobian-vector product, usually the pre-computed gradients w.r.t. each of the outputs. If an output doesn’trequire_grad
, then the gradient can beNone
).
I find this description quite cryptic. What exactly do they mean by Jacobian-vector product? I know what the Jacobian is, but not sure about what product they mean here: element-wise, matrix product, something else? I can't tell from my example below.
And why is "vector" in quotes? Indeed, in the example below I get an error when grad_outputs
is a vector, but not when it is a matrix.
>>> x = torch.tensor([1.,2.,3.,4.], requires_grad=True)
>>> y = torch.outer(x, x)
Why do we observe the following output; how was it computed?
>>> y
tensor([[ 1., 2., 3., 4.],
[ 2., 4., 6., 8.],
[ 3., 6., 9., 12.],
[ 4., 8., 12., 16.]], grad_fn=<MulBackward0>)
>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(y))
(tensor([20., 20., 20., 20.]),)
However, why this error?
>>> torch.autograd.grad(y, x, grad_outputs=torch.ones_like(x))
RuntimeError: Mismatch in shape:
grad_output[0]
has a shape oftorch.Size([4])
andoutput[0]
has a shape oftorch.Size([4, 4])
.
Upvotes: 11
Views: 8330
Reputation: 40668
If we take your example we have function f
which takes as input x
shaped (n,)
and outputs y = f(x)
shaped (n, n)
. The input is described as column vector [x_i]_i for i ∈ [1, n]
, and f(x)
is defined as matrix [y_jk]_jk = [x_j*x_k]_jk for j, k ∈ [1, n]²
.
It is often useful to compute the gradient of the output with respect to the input (or sometimes w.r.t the parameters of f
, there are none here). In the more general case though, we are looking to compute dL/dx
and not just dy/dx
, where dL/dx
is the partial derivative of L
, computed from y
, w.r.t. x
.
The computation graph looks like:
x.grad = dL/dx <------- dL/dy y.grad
dy/dx
x -------> y = x*xT
Then, if we look at dL/dx
, which is, via the chain rule equal to dL/dy*dy/dx
. We have, looking at the interface of torch.autograd.grad
, the following correspondences:
outputs
<-> y
,inputs
<-> x
, andgrad_outputs
<-> dL/dy
.Looking at the shapes: dL/dx
should have the same shape as x
(dL/dx
can be referred to as the 'gradient' of x
), while dy/dx
, the Jacobian matrix, would be 3-dimensional. On the other hand dL/dy
, which is the incoming gradient, should have the same shape as the output, i.e., y
's shape.
We want to compute dL/dx = dL/dy*dy/dx
. If we look more closely, we have
dy/dx = [dy_jk/dx_i]_ijk for i, j, k ∈ [1, n]³
Therefore,
dL/dx = [dL/d_x_i]_i, i ∈ [1,n]
= [sum(dL/dy_jk * d(y_jk)/dx_i over j, k ∈ [1, n]²]_i, i ∈ [1,n]
Back to your example, it means for a given i ∈ [1, n]
: dL/dx_i = sum(dy_jk/dx_i) over j, k ∈ [1,n]²
. And dy_jk/dx_i = f(x_j*x_k)/dx_i
will equal x_j
if i = k
, x_k
if i = j
, and 2*x_i
if i = j = k
(because of the squared x_i
). This being said matrix y
is symmetric... So the result comes down to 2*sum(x_i) over i ∈ [1, n]
This means dL/dx
is the column vector [2*sum(x)]_i for i ∈ [1, n]
.
>>> 2*x.sum()*torch.ones_like(x)
tensor([20., 20., 20., 20.])
Stepping back look at this other graph example, here adding an additional operation after y
:
x -------> y = x*xT --------> z = y²
If you look at the backward pass on this graph, you have:
dL/dx <------- dL/dy <-------- dL/dz
dy/dx dz/dy
x -------> y = x*xT --------> z = y²
With dL/dx = dL/dy*dy/dx = dL/dz*dz/dy*dy/dx
which is in practice computed in two sequential steps: dL/dy = dL/dz*dz/dy
, then dL/dx = dL/dy*dy/dx
.
Upvotes: 18