can anyone look over some simple gradient descent code?

I'm trying to implement a very simple 1-dimensional gradient descent algorithm. The code I have does not work at all. Basically depending on my alpha value, the end parameters will either be wildly huge (like ~70 digits), or basically zero (~ 0.000). I feel like a gradient descent should not be nearly this sensitive in alpha (I'm generating small data in [0.0,1.0], but I think the gradient itself should account for the scale of the data, no?).

Here's the code:

#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <vector>

using namespace std;

double a, b;
double theta0 = 0.0, theta1 = 0.0;

double myrand() {
  return double(rand()) / RAND_MAX;
}

double f(double x) {
  double y = a * x + b;
  y *= 0.1 * (myrand() - 0.5);  // +/- 5% noise

  return y;
}

double h(double x) {
  return theta1 * x + theta0;
}

int main() {
  srand(time(NULL));
  a = myrand();
  b = myrand();

  printf("set parameters: a = %lf, b = %lf\n", a, b);

  int N = 100;

  vector<double> xs(N);
  vector<double> ys(N);
  for (int i = 0; i < N; ++i) {
    xs[i] = myrand();
    ys[i] = f(xs[i]);
  }

  double sensitivity = 0.008;
  double d0, d1;

  for (int n = 0; n < 100; ++n) {
    d0 = d1 = 0.0;
    for (int i = 0; i < N; ++i) {
      d0 += h(xs[i]) - ys[i];
      d1 += (h(xs[i]) - ys[i]) * xs[i];
    }

    theta0 -= sensitivity * d0;
    theta1 -= sensitivity * d1;

    printf("theta0: %lf, theta1: %lf\n", theta0, theta1);
  }

  return 0;
}

Upvotes: 1

Answers (2)

Filippo Pensalfini

Reputation: 1714

I had a quick look at your implementation and it looks fine to me.

The code I have does not work at all.

I wouldn't say that. It seems to behave correctly for small enough values of sensitivity, which is a value that you just have to "guess", and that is how the gradient descent is supposed to work.

I feel like a gradient descent should not be nearly this sensitive in alpha

If you struggle to visualize that, remember that you are using gradient descent to find the minimum of the cost function of linear regression, which is a quadratic function. If you plot the cost function you will see why the learning rate is so sensitive in these cases: intuitively, if the parabola is narrow, the algorithm will converge more quickly, which is good, but then the learning rate is more "sensitive" and the algorithm can easily diverge if you are not careful.

Upvotes: 0

pabaldonedo

Reputation: 957

Changing the value of alpha can produce the algorithm to diverge, so that may be one of the causes of what is happening. You can check by computing the error in each iteration and see if is increasing or decreasing.

In adition, it is recommended to set randomly the values of theta at the beginning in stead of assigning them to zero.

Apart from that, you should divide by N when you update the value of theta as follows:

theta0 -= sensitivity * d0/N;

theta1 -= sensitivity * d1/N;

Upvotes: 2

can anyone look over some simple gradient descent code?

Answers (2)

Related Questions