mon
mon

Reputation: 22244

Backpropagation (Cousera ML by Andrew Ng) gradient descent clarification

Question

Please forgive me asking Coursera ML course specific question. Hope someone who did the couser can answer.

In Coursera ML Week 4 Multi-class Classification and Neural Networks assignment, why the weight (theta) gradient is adding (plus) the derivative instead of subtracting?

% Calculate the gradients of Weight2
% Derivative at Loss function J=L(Z) : dJ/dZ = (oi-yi)/oi(1-oi)
% Derivative at Sigmoid activation function dZ/dY = oi(1-oi)

delta_theta2 = oi - yi;  % <--- (dJ/dZ) * (dZ/dY) 

# Using +/plus NOT -/minus
Theta2_grad = Theta2_grad +     <-------- Why plus(+)?
              bsxfun(@times, hi, transpose(delta_theta2)); 

Code Excerpt

for i = 1:m  
    % i is training set index of X (including bias). X(i, :) is 401 data.
    xi = X(i, :);
    yi = Y(i, :);
    
    % hi is the i th output of the hidden layer. H(i, :) is 26 data.
    hi = H(i, :);
    
    % oi is the i th output layer. O(i, :) is 10 data.
    oi = O(i, :);
    
    %------------------------------------------------------------------------
    % Calculate the gradients of Theta2
    %------------------------------------------------------------------------
    delta_theta2 = oi - yi;
    Theta2_grad = Theta2_grad + bsxfun(@times, hi, transpose(delta_theta2));
 
    %------------------------------------------------------------------------
    % Calculate the gradients of Theta1
    %------------------------------------------------------------------------
    % Derivative of g(z): g'(z)=g(z)(1-g(z)) where g(z) is sigmoid(H_NET).
    dgz = (hi .* (1 - hi));
    delta_theta1 = dgz .* sum(bsxfun(@times, Theta2, transpose(delta_theta2)));
    % There is no input into H0, hence there is no theta for H0. Remove H0.
    delta_theta1 = delta_theta1(2:end);
    Theta1_grad = Theta1_grad + bsxfun(@times, xi, transpose(delta_theta1));
end

I thought it is subtracting the derivative.

enter image description here

Upvotes: 0

Views: 266

Answers (1)

ntlarry
ntlarry

Reputation: 250

Since the gradients are calculated by averaging the gradients across all training examples, we first "accumulate" the gradients while looping across all the training examples. We do this by summing the gradient across all training examples. So the line you highlighted with the plus is not the gradient update step. (Notice that alpha is not there as well.) It might be somewhere else. It is most likely outside of the loop from 1 to m.

Also, I am not sure when you will learn about this (I'm sure it's somewhere in the course), but you could also vectorize the code :)

Upvotes: 1

Related Questions