Why does simple gradient descent diverge?

Question

This is my second attempt at implementing gradient descent in one variable and it always diverges. Any ideas?

This is simple linear regression for minimizing the residual sum of squares in one variable.

def gradient_descent_wtf(xvalues, yvalues):
    tolerance = 0.1

    #y=mx+b
    #some line to predict y values from x values
    m=1.
    b=1.

    #a predicted y-value has value mx + b

    for i in range(0,10):

        #calculate y-value predictions for all x-values
        predicted_yvalues = list()
        for x in xvalues:
            predicted_yvalues.append(m*x + b)

        # predicted_yvalues holds the predicted y-values

        #now calculate the residuals = y-value - predicted y-value for each point
        residuals = list()
        number_of_points = len(yvalues)
        for n in range(0,number_of_points):
            residuals.append(yvalues[n] - predicted_yvalues[n])

        ## calculate the residual sum of squares from the residuals, that is,
        ## square each residual and add them all up. we will try to minimize
        ## the residual sum of squares later.
        residual_sum_of_squares = 0.
        for r in residuals:
            residual_sum_of_squares += r**2
        print("RSS = %s" % residual_sum_of_squares)
        ##
        ##
        ##

        #now make a version of the residuals which is multiplied by the x-values
        residuals_times_xvalues = list()
        for n in range(0,number_of_points):
            residuals_times_xvalues.append(residuals[n] * xvalues[n])

        #now create the sums for the residuals and for the residuals times the x-values
        residuals_sum = sum(residuals)

        residuals_times_xvalues_sum = sum(residuals_times_xvalues)

        # now multiply the sums by a positive scalar and add each to m and b.

        residuals_sum *= 0.1
        residuals_times_xvalues_sum *= 0.1

        b += residuals_sum
        m += residuals_times_xvalues_sum

        #and repeat until convergence.
        #convergence occurs when ||sum vector|| < some tolerance.
        # ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )

        #check for convergence
        magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
        if magnitude_of_sum_vector < tolerance:
            break

    return (b, m)

Result:

gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
Out[108]:
(-3.475524066284303e+16, -2.4195981188763203e+17)

John Coleman · Accepted Answer

The gradients are huge -- hence you are following large vectors for long distances (0.1 times a large number is large). Find unit vectors in the appropriate direction. Something like this (with comprehensions replacing your loops):

def gradient_descent_wtf(xvalues, yvalues):
    tolerance = 0.1

    m=1.
    b=1.

    for i in range(0,10):
        predicted_yvalues = [m*x+b for x in xvalues]

        residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]

        residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
        print("RSS = %s" % residual_sum_of_squares)

        residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]

        residuals_sum = sum(residuals)

        residuals_times_xvalues_sum = sum(residuals_times_xvalues)

        # (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
        # gradient direction. *Find a unit vector which points in same direction*

        magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5

        residuals_sum /= magnitude
        residuals_times_xvalues_sum /= magnitude

        b += residuals_sum * (0.1)
        m += residuals_times_xvalues_sum * (0.1)

        #check for convergence -- this needs work!
        magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
        if magnitude_of_sum_vector < tolerance:
            break

    return (b, m)

For example:

>>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 368732.1655050716
RSS = 367039.18363896786
RSS = 365354.0543519137
RSS = 363676.7775934381
RSS = 362007.3533123621
RSS = 360345.7814567845
RSS = 358692.061974069
RSS = 357046.1948108295
RSS = 355408.17991291644
(1.1157111313023558, 1.9932828425473605)

which is certainly much more plausible.

It isn't a trivial matter to make a numerically stable gradient-descent algorithm. You might want to consult a decent textbook in numerical analysis.

Why does simple gradient descent diverge?

Answers (2)

Related Questions