Ridge regression using stochastic gradient descent in Python

I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:

def fit(self, X, Y):
    # Convert to data frame in case X is numpy matrix
    X = pd.DataFrame(X)

    # Define a function to calculate the error given a weight vector beta and a training example xi, yi

    # Prepend a column of 1s to the data for the intercept
    X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))

    # Find dimensions of train
    m, d = X.shape

    # Initialize weights to random
    beta = self.initializeRandomWeights(d)
    beta_prev = None

    epochs = 0
    prev_error = None
    while (beta_prev is None or epochs < self.nb_epochs):
        print("## Epoch: " + str(epochs))
        indices = range(0, m)
        shuffle(indices)
        for i in indices:   # Pick a training example from a randomly shuffled set
            beta_prev = beta
            xi = X.iloc[i]
            errori = sum(beta*xi) - Y[i]    # Error[i] = sum(beta*x) - y = error of ith training example
            gradient_vector = xi*errori + self.l*beta_prev
            beta = beta_prev - self.alpha*gradient_vector
        epochs += 1

The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.

My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?

Upvotes: 1

Answers (3)

Tomer Wolberg

Reputation: 1356

In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1/s+pn/s^2) where p is the ridge penalty and n is the number of rows in the matrix because ridge regression is a type of spectral regularization for linear regression that changes each singular value v to v/(1+pn/v).

Normalizing the rows of the matrix (or gradients of the loss function) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.

But note that when deciding if you should normalize the rows or not you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not the new loss function fits your needs, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy/norm for the signal means better signal to noise ratio) it doesn't make sense to normalize.

Upvotes: 0

Divyesh Peshavaria

Reputation: 599

You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.

Also, normalization is advantageous for regression methods.

The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.

SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.

Upvotes: 1

sascha

Reputation: 33542

Your assumption is not correct.

It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.

Normalization

When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
- I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!

Convergence

There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
- SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
- Even optimizing normalized data can fail with SGD when those params are bad!
  - This is one of the most important downsides of SGD; dependence on hyper-parameters
- As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!

Upvotes: 0

Ridge regression using stochastic gradient descent in Python

Answers (3)

Normalization

Convergence

Related Questions