theideasmith
theideasmith

Reputation: 2925

What is the issue with this implementation of gradient descent?

I tried to implement linear regresion with gradient descent, but my error diverges to infinity. I've read over my code and still cannot figure out where I went wrong. I'm hoping someone can help me debug why this implementation of linear regression isn't working.

When N=100 then there are no problems, but when N=1000 then divergence to infinity is observed.

import numpy as np

class Regression:
    def __init__(self, xs, ys, w,alpha):
        self.w = w
        self.xs = xs
        self.ys = ys
        self.a = alpha
        self.N = float(len(xs))

    def error(self, ys, yhat):
        return (1./self.N)*np.sum((ys-yhat)**2)

    def propagate(self):
        yhat = xs*w[0]+w[1]
        loss = yhat - self.ys

        r1 = (2./self.N)*np.sum(loss*self.xs)
        r2 = (2./self.N)*np.sum(loss)

        self.w[0] -= self.a*r1
        self.w[1] -= self.a*r2


N = 600
xs = np.arange(0,N)
bias = np.random.sample(size=N)*10
ys = xs * 2. + 2. + bias
ws = np.array([0.,0.])

regressor = Regression(
    xs, ys, ws,
    0.00001)

for i in range(1000):
    regressor.propagate()

Output:

...
2.71623180177e+286
5.27841816362e+286
1.02574818143e+287
1.99332318715e+287
3.87359919362e+287
7.52751526171e+287
1.46281231441e+288
2.84266426942e+288
5.52411274435e+288
1.07349369184e+289
2.0861064206e+289
4.05390365232e+289
7.87789858657e+289
1.5309018532e+290
2.97498179035e+290
5.78124367308e+290
1.12346161297e+291
2.18320843611e+291
4.24260074438e+291
8.2445912074e+291
1.6021607564e+292
3.11345829619e+292
6.05034327761e+292
1.17575539141e+293
2.28483026006e+293
4.4400811218e+293
8.62835227315e+293

Upvotes: 0

Views: 67

Answers (2)

Prune
Prune

Reputation: 77837

You've exceeded the convergence radius of your method. I put in a print statement to trace the effect, at the bottom of propagate:

    self.w = np.array(res).astype(np.float)
    print self.error(ys, yhat), '\t', r1, '\t', r2, '\t', self.w

As K.A. Buhr pointed out, r1 scales quadratically with N. Choose your learning rate according to the input; it's not a promised constant with the SGD algorithm. Here's the output from the first 20 iterations with N=600, as in your code:

486826.997899   -482786.592791  -1211.52883528  [ 4.82786593  0.01211529]
946024.542374   673013.376697   1680.38708612   [-1.90226784 -0.00468858]
1838377.19732   -938192.956012  -2350.99664804  [ 7.47966172  0.01882138]
3572474.5816    1307858.19046   3268.82617841   [-5.59892018 -0.01386688]
6942323.62211   -1823178.2573   -4565.30975898  [ 12.63286239   0.03178622]
13490907.7204   2541543.91414   6355.61930844   [-12.78257675  -0.03176997]
26216686.5837   -3542958.75828  -8868.35584965  [ 22.64701083   0.05691359]
50946528.2176   4938949.44036   12354.1444796   [-26.74248357  -0.06662786]
99003709.9274   -6884985.98436  -17230.4097511  [ 42.10737627   0.10567624]
192392610.191   9597796.6223    24011.0009034   [-53.87058995  -0.13443377]
373874053.385   -13379504.31    -33480.2810842  [ 79.92445315   0.20036904]
726544597.0     18651274.1534   46663.6193386   [-106.58828839   -0.26626715]
1411884707.51   -26000217.8559  -65058.4461128  [ 153.41389017    0.38431731]
2743697288.89   36244780.0586   90684.1600127   [-209.03391041   -0.52252429]
5331791469.79   -50525887.4157  -126423.886221  [ 296.22496374    0.74171457]
10361201450.4   70434012.7562   176228.707876   [-408.11516382   -1.02057251]
20134788880.2   -98186304.1721  -245674.553107  [ 573.7478779     1.43617302]
39127675046.8   136873506.894   342466.322375   [-794.98719104   -1.9884902 ]
76036305324.8   -190804176.229  -477412.833248  [ 1113.05457125     2.78563813]
147760369643.0  265984517.38    665513.730619   [-1546.79060255    -3.86949918]

However, with alpha set to E-6 (instead of E-5), the first 10 lines are

14495.6359775   -13788.3126768  -211.542964687  [ 0.01378831  0.00021154]
14306.0982004   -13697.7438847  -210.177498646  [ 0.02748606  0.00042172]
14119.0422005   -13607.7699931  -208.821001646  [ 0.04109383  0.00063054]
13934.4354818   -13518.3870942  -207.473414775  [ 0.05461221  0.00083801]
13752.2459738   -13429.5913063  -206.134679506  [ 0.0680418   0.00104415]
13572.4420258   -13341.3787729  -204.804737697  [ 0.08138318  0.00124895]
13394.9924018   -13253.7456628  -203.483531589  [ 0.09463693  0.00145244]
13219.8662747   -13166.6881702  -202.171003801  [ 0.10780362  0.00165461]
13047.0332208   -13080.202514   -200.867097331  [ 0.12088382  0.00185548]
12876.4632151   -12994.2849383  -199.571755548  [ 0.13387811  0.00205505]
12708.1266257   -12908.9317115  -198.284922195  [ 0.14678704  0.00225333]

... and it continues to converge. BTW, 1000 iterations are not enough to achieve proper convergence even at N=600; you might want to use an epsilon figure rather than quantity of iterations.

Upvotes: 1

K. A. Buhr
K. A. Buhr

Reputation: 50819

As you increase N, the gradient components r1 and r2 at the starting point w=[0,0] scale, respectively, quadratically and linearly with N. For sufficiently large N, the initial step size for the vector w becomes larger than twice its error, which causes the correction to overshoot and actually increase the error. The positive feedback results in w oscillating around the correct value with ever increasing amplitude instead of converging.

If you make alpha ten times smaller, you'll find that N=1000 will converge.

Upvotes: 3

Related Questions