Ant
Ant

Reputation: 1143

Is this Neural Net example I'm looking at a mistake or am I not understanding backprop?

Is this model using one relu in two places, or are gradients computed by doing a matrix multiplication of layers on both sides of one layer?

In the last layer of this simple neural net (below) during back prop it calculates the gradient for the last layer w2 by doing a matrix multiplication of y prediction - y and h_relu, which I thought was only between layers w1 and w2 not between w2 and y_pred

The line in question is near the bottom. It is grad_w2 = h_relu.t().mm(grad_y_pred).
I am confused because I thought everything was supposed to go in order forward and go in order backwards. Is this relu being used in two places?

Here is an attempt at a visual illustration of the model.

enter image description here

This example is from the Pytorch website. It is the second block of code on the page.

grad_w2 = h_relu.t().mm(grad_y_pred)


import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

I appreciate your patience looking at this and trying to clear this up for me.
If you can try adding another layer of whieghts in the middle with another relu that might help me understand. This is what I was trying to do.

Upvotes: 1

Views: 157

Answers (1)

jodag
jodag

Reputation: 22184

Consider the following diagram which represents the network in question. The concept of back-propagation is simply a way to quickly and intuitively apply the chain rule on a complex sequence of operations to compute the gradient of an output w.r.t. a tensor. Usually we are interested in computing the gradients of leaf tensors (tensors which are not derived from other tensors) with respect to a loss or objective. All the leaf tensors are represented as circles in the following diagram and the loss is represented by the rectangle with the L label.

enter image description here

Using the backward diagram we can follow the path from L to w1 and w2 in order to determine which partial derivatives we need in order to compute the gradient of L w.r.t. w1 and w2. For simplicity we will assume that all the leaf tensors are scalars so as to avoid getting into the complexities of multiplying vectors and matrices.

Using this approach the gradients of L w.r.t. w1 and w2 are

enter image description here

and

enter image description here

Something to notice is that since w2 is a leaf tensor, we only use dy/dw2 (aka grad_w2) during computation of dL/dw2 since it isn't part of the path from L to w1.

Upvotes: 1

Related Questions