DSH
DSH

Reputation: 1139

Forward vs reverse mode differentiation - Pytorch

In the first example of Learning PyTorch with Examples, the author demonstrates how to create a neural network with numpy. Their code is pasted below for convenience:

# from: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

What is confusing to me is why gradients of w1 and w2 are computed with respect to loss (2nd to last code block).

Normally the opposite computation happens: the gradients of loss is computed with respect to the weights, as quoted here:

So my question is: why is the derivation computation in the example above in reverse order as compared to normal back propagation computations?

Upvotes: 2

Views: 890

Answers (1)

jodag
jodag

Reputation: 22244

Seems to be a typo in the comment. They are actually computing gradient of loss w.r.t. w2 and w1.

Let's quickly derive the gradient of loss w.r.t. w2 just to be sure. By inspection of your code we have

enter image description here

Using the chain rule from calculus

enter image description here.

Each term can be represented using the basic rules of matrix calculus. These turn out to be

enter image description here

and

enter image description here.

Plugging these terms back into the initial equation we get

enter image description here.

Which perfectly matches the expressions described by

grad_y_pred = 2.0 * (y_pred - y)       # gradient of loss w.r.t. y_pred
grad_w2 = h_relu.T.dot(grad_y_pred)    # gradient of loss w.r.t. w2

in the back-propagation code you provided.

Upvotes: 2

Related Questions