Reputation: 43
I have a small data set with 47 samples. I'm running linear regression with 2 features.
After running LinearRegression I ran Ridge (with sag). I would expect it to converge quickly, and return exactly the same prediction as computed solving the normal equations.
But every time I run Ridge I get a different result, close to the result provided by LinearRegression but not exactly the same. It doesn't matter how many iterations I run. Is this expected? Why? In the past I've implemented regular gradient descent myself and it quickly converges in this data set.
ols = sklearn.linear_model.LinearRegression()
model = ols.fit(x_train, y_train)
print(model.predict([[1650,3]]))
%[[ 293081.4643349]]
scaler=preprocessing.StandardScaler().fit(x_train)
ols = sklearn.linear_model.Ridge(alpha=0,solver="sag",max_iter=99999999,normalize=False)
model = ols.fit(x_scaled, y_train)
x_test=scaler.transform([[1650,3]])
print(model.predict(x_test))
%[[ 293057.69986594]]
Upvotes: 1
Views: 2530
Reputation: 43
Thank you all for your answers! After reading @sascha response I read a little bit more on Stochastic Average Gradient Descent and I think I've found the reason of this discrepancy and it seems in fact that it's due to the "stochastic" part of the algorithm.
Please check the wikipedia page: https://en.wikipedia.org/wiki/Stochastic_gradient_descent
In regular gradient descent we update the weights on every iteration based on this formula:
Where the second term of the sum is the gradient of the cost function multiplied by a learning rate mu.
This is repeated until convergence, and it always gives the same result after the same number of iterations, given the same starting weights.
In Stochastic Gradient Descent this is done instead in every iteration:
Where the second part of the sum is the gradient in a single sample (multiplied by the learning rate mu). All the samples are randomized at the beginning, and then the algorithm cycles through them at every iteration.
So I think a couple of things contribute to the behavior I asked about:
(EDITED see replies below)
(EDIT)(This can be made deterministic by using random_state when calling the fit method)
(EDIT)(Convergence criteria depends on tol (precision of the solution). By modifying this parameter (I set it to 1e-100) I was able to obtain the same solution as the one reported by LinearRegression)
Upvotes: 1
Reputation: 331
The difference between your two different outputs may come from your preprocessing that you only do for the Ridge regression :scaler=preprocessing.StandardScaler().fit(x_train)
.
By doing such normalization you change the representation of your data and it may lead to different results.
Note also the fact that by doing OLS you penalize the L2 norm looking only at the output differences (expected vs predicted) while Ridge the algorithm's also taking into account the input match or mismatch
Upvotes: 0