Reputation: 1187
I was going through official pytorch tut, where it explains tensor gradients and Jacobian products as follows:
Instead of computing the Jacobian matrix itself, PyTorch allows you to compute Jacobian Product for a given input vector v=(v1…vm). This is achieved by calling backward with v as an argument:
inp = torch.eye(5, requires_grad=True)
out = (inp+1).pow(2)
out.backward(torch.ones_like(inp), retain_graph=True)
print("First call\n", inp.grad)
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nSecond call\n", inp.grad)
inp.grad.zero_()
out.backward(torch.ones_like(inp), retain_graph=True)
print("\nCall after zeroing gradients\n", inp.grad)
Ouptut:
First call
tensor([[4., 2., 2., 2., 2.],
[2., 4., 2., 2., 2.],
[2., 2., 4., 2., 2.],
[2., 2., 2., 4., 2.],
[2., 2., 2., 2., 4.]])
Second call
tensor([[8., 4., 4., 4., 4.],
[4., 8., 4., 4., 4.],
[4., 4., 8., 4., 4.],
[4., 4., 4., 8., 4.],
[4., 4., 4., 4., 8.]])
Call after zeroing gradients
tensor([[4., 2., 2., 2., 2.],
[2., 4., 2., 2., 2.],
[2., 2., 4., 2., 2.],
[2., 2., 2., 4., 2.],
[2., 2., 2., 2., 4.]])
Though I get what is Jacobian matrix is, I didnt get how is this Jacobian product is calculated.
Here, are different tensors I tried to print out to get understanding:
>>> out
tensor([[4., 1., 1., 1., 1.],
[1., 4., 1., 1., 1.],
[1., 1., 4., 1., 1.],
[1., 1., 1., 4., 1.],
[1., 1., 1., 1., 4.]], grad_fn=<PowBackward0>)
>>> torch.eye(5)
tensor([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]])
>>> torch.ones_like(inp)
tensor([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
>>> inp
tensor([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 0., 1.]], requires_grad=True)
But I didnt get how the tuts output is calculated. Can someone explain a bit of Jacobian matrix with calculations done in this example?
Upvotes: 1
Views: 1826
Reputation: 40618
We will go through the entire process: from computing the Jacobian to applying it to get the resulting gradient for this input. We're looking at the operation f(x) = (x + 1)²
, in the simple scalar setting, we get df/dx = 2(x + 1)
as complete derivative.
In the multi-dimensional setting, we have an input x_ij
, and an output y_mn
, indexed by (i, j)
, and (m, n)
respectively. The function mapping is defined as y_mn = (x_mn + 1)²
.
First, we should look at the Jacobian itself, this corresponds to the tensor J
containing all partial derivatives J_ijmn = dy_mn/dx_ij
. From the expression of y_mn
we can say that for all i
, j
, m
, and n
: dy_mn/dx_ij = d(x_mn + 1)²/dx_ij
which is 0
if m≠i
or n≠j
. Else, i.e. m=i
or n=j
, we have that d(x_mn + 1)²/dx_ij = d(x_ij + 1)²/dx_ij = 2(x_ij + 1)
.
As a result, J_ijmn
can be simply defined as
↱ 2(x_ij + 1) if i=m, j=n
J_ijmn =
↳ 0 else
From the rule chain the gradient of the output with respect to the input x
is denoted as dL/dx = dL/dy*dy/dx
. From a PyTorch perspective we have the following relationships:
x.grad = dL/dx
, shaped like x
,dL/dy
is the incoming gradient: the gradient
argument in the backward
functiondL/dx
is the Jacobian tensor described above.As explained in the documentation, applying backward
doesn't actually provide the Jacobian. It computes the chain rule product directly and stores the gradient (i.e. dL/dx
inside x.grad
).
In terms of shapes, the Jacobian multiplication dL/dy*dy/dx = gradient*J
reduces itself to a tensor of the same shape as x
.
The operation performed is defined by: [dL/dx]_ij = ∑_mn([dL/dy]_ij * J_ijmn)
.
If we apply this to your example. We have x = 1(i=j)
(where 1(k): (k == True) -> 1
is the indicator function), essentially just the identity matrix.
We compute the Jacobian:
↱ 2(1(i=j) + 1) = if i=m, j=n
J_ijmn =
↳ 0 else
which becomes
↱ 2(1 + 1) = 4 if i=j=m=n
J_ijmn = → 2(0 + 1) = 2 if i=m, j=n, i≠j
↳ 0 else
For visualization purposes, we will stick with x = torch.eye(2)
:
>>> f = lambda x: (x+1)**2
>>> J = A.jacobian(f, inp)
tensor([[[[4., 0.],
[0., 0.]],
[[0., 2.],
[0., 0.]]],
[[[0., 0.],
[2., 0.]],
[[0., 0.],
[0., 4.]]]])
Then computing the matrix multiplication using torch.einsum
(I won't go into details, look through this, then this for an in-depth overview of the EinSum summation operator):
>>> torch.einsum('ij,ijmn->mn', torch.ones_like(inp), J)
tensor([[4., 2.],
[2., 4.]])
This matches what you get when back propagating from out
with torch.ones_like(inp)
as incoming gradient:
>>> out = f(inp)
>>> out.backward(torch.ones_like(inp))
>>> inp.grad
tensor([[4., 2.],
[2., 4.]])
If you backpropagate twice (while retaining the graph of course) you end up computing the same operation which accumulating on the parameter's grad
attribute. So, naturally, after two backward passes you have twice the gradient:
>>> out = f(inp)
>>> out.backward(torch.ones_like(inp), retain_graph=True)
>>> out.backward(torch.ones_like(inp))
>>> inp.grad
tensor([[8., 4.],
[4., 8.]])
Those gradients will accumulate, you can reset them by calling the inplace function zero_
: inp.grad.zero_()
. From there if you backpropagate again you will stand with one accumulate gradient only.
In practice, you would register your parameters on an optimizer, from which you can call zero_grad
enabling you to handle and reset all parameters in that collection in one go.
I have imported torch.autograd.functional
as A
Upvotes: 2