Reputation: 81
I have a trained network. And I want to calculate the gradients of outputs w.r.t. the inputs. By querying the PyTorch Docs, torch.autograd.grad may be useful. So, I use the following code:
x_test = torch.randn(D_in,requires_grad=True)
y_test = model(x_test)
d = torch.autograd.grad(y_test, x_test)[0]
model
is the neural network. x_test
is the input of size D_in
and y_test
is a scalar output.
I want to compare the calculated result with numerical difference by scipy.misc.derivative
.
So, I calculated the partial derivate by setting a index.
idx = 3
x_test = torch.randn(D_in,requires_grad=True)
y_test = model(x_test)
print(x_test[idx].item())
d = torch.autograd.grad(y_test, x_test)[0]
print(d[idx].item())
def fun(x):
x_input = x_test.detach()
x_input[idx] = x
with torch.no_grad():
y = model(x_input)
return y.item()
x0 = x_test[idx].item()
print(x0)
print(derivative(fun, x0, dx=1e-6))
But I got totally different results.
The gradient calculated by torch.autograd.grad
is -0.009522666223347187
,
while that by scipy.misc.derivative
is -0.014901161193847656
.
Is there anything wrong about the calculation? Or I use torch.autograd.grad
wrongly?
Upvotes: 7
Views: 4291
Reputation: 11440
In fact, it is very likely that your given code is completely correct. Let me explain this by redirecting you to a little background information on backpropagation, or rather in this case Automatic Differentiation (AutoDiff).
The specific implementation of many packages is based on AutoGrad, a common technique to get the exact derivatives of a function/graph. It can do this by essentially "inverting" the forward computational pass to compute piece-wise derivatives of atomic function blocks, like addition, subtraction, multiplication, division, etc., and then "chaining them together".
I explained AutoDiff and its specifics in a more detailed answer in this question.
On the contrary, scipy's derivative function is only an approximation to this derivative by using finite differences. You would take the results of the function at close-by points, and then calculate a derivative based on the difference in function values for those points. This is why you see a slight difference in the two gradients, since this can be an inaccurate representation of the actual derivative.
Upvotes: 2