Janusz Chudzynski
Janusz Chudzynski

Reputation: 2710

Gradient descent math implementation explanation needed.

I know the solution but I don't understand how the following equation was translated to code.

Gradient Descent

Solution

grad  = (1/m) * ((sigmoid(X * theta)-y)' * X);

Upvotes: 2

Views: 469

Answers (2)

cangrejo
cangrejo

Reputation: 2202

As it's been said, the mathematical expression you've posted is the cost function, whereas the code snippet you show is the gradient.

However, the summation is not missing. Let's break it down.

The gradient of the cost function with respect to the j-th parameter is enter image description here

With X * theta you get a vector that contains the dot product of all your data points and your parameter vector.

With sigmoid(X * theta) you evaluate the sigmoid of each of those dot products.

With X * theta)-y you get a vector containing the differences between all your predictions and the actual labels.

With sigmoid(X * theta)-y)' * X you are transposing the vector of sigmoid evaluations and computing its dot product with each of the columns of your data set (i.e. each of the x_j's for each data point).

Think about it for a second, and you'll see how that's exactly the summation in the expression, but evaluated for all the entries of your parameter vector, not just j.

Upvotes: 1

stackoverflowuser2010
stackoverflowuser2010

Reputation: 40889

The original line J(theta) represents the cost function for logistic regression.

The code that you showed, grad = ..., is the gradient of J(theta) with respect to the parameters; that is, grad is an implementation of d/dtheta J(theta). The derivative is important because that is used in gradient descent to move the parameters toward their optimal values (to minimize the cost J(theta)).

Below is the formula for the gradient, outlined in red, taken from the first link below. Note that J(theta) is the same as your formula above and h(x) represents the sigmoid function.

The total gradient over all training examples requires a summation over m. In your code for grad above, you are computing the gradient over one training example due to the omission of the summation; thus, your code is probably computing the gradient for stochastic gradient descent, not full gradient descent.

enter image description here

For more information, you can google for "logistic regression cost function derivative", which leads to these links:

  1. This one in particular has everything you need: http://feature-space.com/2011/10/28/logistic-cost-function-derivative/

  2. These are apparently some lecture notes from Andrew Ng's class on machine learning and logistic regression with gradient descent: http://www.holehouse.org/mlclass/06_Logistic_Regression.html

  3. Explanation of how to compute the derivative step-by-step: https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression

Upvotes: 1

Related Questions