Fackelmann
Fackelmann

Reputation: 43

Python Scikit - LinearRegression and Ridge return different results

I have a small data set with 47 samples. I'm running linear regression with 2 features.

After running LinearRegression I ran Ridge (with sag). I would expect it to converge quickly, and return exactly the same prediction as computed solving the normal equations.

But every time I run Ridge I get a different result, close to the result provided by LinearRegression but not exactly the same. It doesn't matter how many iterations I run. Is this expected? Why? In the past I've implemented regular gradient descent myself and it quickly converges in this data set.

ols = sklearn.linear_model.LinearRegression()
model = ols.fit(x_train, y_train)
print(model.predict([[1650,3]]))
 %[[ 293081.4643349]]

scaler=preprocessing.StandardScaler().fit(x_train)
ols = sklearn.linear_model.Ridge(alpha=0,solver="sag",max_iter=99999999,normalize=False)
model = ols.fit(x_scaled, y_train)
x_test=scaler.transform([[1650,3]])
print(model.predict(x_test))
 %[[ 293057.69986594]]

Upvotes: 1

Views: 2530

Answers (2)

Fackelmann
Fackelmann

Reputation: 43

Thank you all for your answers! After reading @sascha response I read a little bit more on Stochastic Average Gradient Descent and I think I've found the reason of this discrepancy and it seems in fact that it's due to the "stochastic" part of the algorithm.

Please check the wikipedia page: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

In regular gradient descent we update the weights on every iteration based on this formula: gradient descent

Where the second term of the sum is the gradient of the cost function multiplied by a learning rate mu.

This is repeated until convergence, and it always gives the same result after the same number of iterations, given the same starting weights.

In Stochastic Gradient Descent this is done instead in every iteration:

stochastic gradient descent

Where the second part of the sum is the gradient in a single sample (multiplied by the learning rate mu). All the samples are randomized at the beginning, and then the algorithm cycles through them at every iteration.

So I think a couple of things contribute to the behavior I asked about:

(EDITED see replies below)

  1. The point used to calculate the gradient at every iteration changes every time I re-run the fit function. That's why I don't obtain the same result every time.

(EDIT)(This can be made deterministic by using random_state when calling the fit method)

  1. I also realized that the number of iterations that the algorithm runs varies between 10 and 15 (regardless of the max_limit I set). I couldn't find anywhere what the criteria for convergence is in scikit, but my guess is that if I could tighten it (i.e. run for more iterations) the answer I would get would be much closer to the LinearRegression method.

(EDIT)(Convergence criteria depends on tol (precision of the solution). By modifying this parameter (I set it to 1e-100) I was able to obtain the same solution as the one reported by LinearRegression)

Upvotes: 1

nsaura
nsaura

Reputation: 331

The difference between your two different outputs may come from your preprocessing that you only do for the Ridge regression :scaler=preprocessing.StandardScaler().fit(x_train).

By doing such normalization you change the representation of your data and it may lead to different results.

Note also the fact that by doing OLS you penalize the L2 norm looking only at the output differences (expected vs predicted) while Ridge the algorithm's also taking into account the input match or mismatch

Upvotes: 0

Related Questions