What determines whether my Python gradient descent algorithm converges?

Question

I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).

The data I am using are below. sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar).

sales   temp
215     14.20
325     16.40
185     11.90
332     15.20
406     18.50
522     22.10
412     19.40
614     25.10
544     23.40
421     18.10
445     22.60
408     17.20

And this is my data after it has been normalized:

sales        temp 
0.06993007  0.174242424
0.326340326 0.340909091
0           0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1           1
0.836829837 0.871212121
0.55011655  0.46969697
0.606060606 0.810606061
0.51981352  0.401515152

My code for the algorithm:

import numpy as np
import pandas as pd
from scipy import stats

class SLRegression(object):
    def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):

        # Initialize learnrate, tolerance, and max_iter.
        self.learnrate = learnrate
        self.tolerance = tolerance
        self.max_iter = max_iter

    # Define the gradient descent algorithm.
    def fit(self, data):
        # data   :   array-like, shape = [m_observations, 2_columns] 

        # Initialize local variables.
        converged = False
        m = data.shape[0]

        # Track number of iterations.
        self.iter_ = 0

        # Initialize theta0 and theta1.
        self.theta0_ = 0
        self.theta1_ = 0

        # Compute the cost function.
        J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
        print('J is: ', J)

        # Iterate over each point in data and update theta0 and theta1 on each pass.
        while not converged:
            diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
            diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])

            # Subtract the learnrate * partial derivative from theta0 and theta1.
            temp0 = self.theta0_ - (self.learnrate * diftemp0)
            temp1 = self.theta1_ - (self.learnrate * diftemp1)

            # Update theta0 and theta1.
            self.theta0_ = temp0
            self.theta1_ = temp1

            # Compute the updated cost function, given new theta0 and theta1.
            new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
            print('New J is: %s') % (new_J)

            # Test for convergence.
            if abs(J - new_J) <= self.tolerance:
                converged = True
                print('Model converged after %s iterations!') % (self.iter_)

            # Set old cost equal to new cost and update iter.
            J = new_J
            self.iter_ += 1

            # Test whether we have hit max_iter.
            if self.iter_ == self.max_iter:
                converged = True
                print('Maximum iterations have been reached!')

        return self

    def point_forecast(self, x):
        # Given feature value x, returns the regression's predicted value for y.
        return self.theta0_ + self.theta1_ * x


# Run the algorithm on a data set.
if __name__ == '__main__':
    # Load in the .csv file.
    data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))

    # Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
    slregression = SLRegression()

    # Call the fit function and pass in the data.
    slregression.fit(data)

    # Print out the results.
    print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
    # Compare our model to scipy linregress model.
    slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
    print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)

    # Test the model with a point forecast.
    print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
    print('The true y-value for x = .87 is about .8368.')

I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong. Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005. This brings changes in the cost function from iteration to iteration down to around 614, but I can't get it to go any lower.

Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why? Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values? For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value?

All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time. Is this normal, or are there flaws in my algorithm that are contributing to this issue?

jedwards · Accepted Answer

Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005

That's kind of to be expected the way you have your algorithm set up.

The normalization of the data makes it so the y-intercept of the best fit is around 0.0. Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.

Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?

No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)). Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.

Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?

Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value. (The code for normalization is outside of what you have shown). If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).

All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.

For a well-formed problem, if you're in fact diverging it often means your step size is too large. Try lowering it.

If it's simply not converging before it hits the max iterations, that could be a few issues:

Your step size is too small,
Your tolerance is too small,
Your max iterations is too small,
Your starting point is poorly chosen

In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.

The problem with this is that by increasing the step size to get to the area of interest faster also makes it less-likely to find convergence once it gets there.

To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.

It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them. You'll be able to see the difference immediately between using normalized and non-normalized input data.

What determines whether my Python gradient descent algorithm converges?

Answers (1)

Related Questions