Gradient descent math implementation explanation needed.

Question

I know the solution but I don't understand how the following equation was translated to code.

Why the sum is missing?
Why are we transposing the ((sigmoid(X * theta)-y) expression?

Solution

grad  = (1/m) * ((sigmoid(X * theta)-y)' * X);

stackoverflowuser2010 · Accepted Answer

The original line J(theta) represents the cost function for logistic regression.

The code that you showed, grad = ..., is the gradient of J(theta) with respect to the parameters; that is, grad is an implementation of d/dtheta J(theta). The derivative is important because that is used in gradient descent to move the parameters toward their optimal values (to minimize the cost J(theta)).

Below is the formula for the gradient, outlined in red, taken from the first link below. Note that J(theta) is the same as your formula above and h(x) represents the sigmoid function.

The total gradient over all training examples requires a summation over m. In your code for grad above, you are computing the gradient over one training example due to the omission of the summation; thus, your code is probably computing the gradient for stochastic gradient descent, not full gradient descent.

For more information, you can google for "logistic regression cost function derivative", which leads to these links:

This one in particular has everything you need: http://feature-space.com/2011/10/28/logistic-cost-function-derivative/
These are apparently some lecture notes from Andrew Ng's class on machine learning and logistic regression with gradient descent: http://www.holehouse.org/mlclass/06_Logistic_Regression.html
Explanation of how to compute the derivative step-by-step: https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression

Gradient descent math implementation explanation needed.

Answers (2)

Related Questions