Understanding Ridge Linear Regression in sci-kit learn

I'm trying to understand how Ridge regression is implemented in scikit-learn Ridge

Ridge regression has the closed form solution for minimizing (y - Xw)^2 + \alpha * |w|^2, which is (X'*X + \alpha * I)^{-1} X'y

The intercept and coef of the fit model seems not to be identical to the closed form solution. Any ideas how exactly ridge regression is implemented in scikit-learn?

from sklearn import datasets
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
import numpy as np

# prepare dataset
boston = datasets.load_boston()
X = boston.data
y = boston.target
# add the w_0 intercept where the corresponding x_0 = 1
Xp = np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)

alpha = 0.5
ridge = Ridge(fit_intercept=True, alpha=alpha)
ridge.fit(X, y)

# 1. intercept and coef of the fit model
print np.array([ridge.intercept_] + list(ridge.coef_))
# output:
# array([  3.34288615e+01,  -1.04941233e-01,   4.70136803e-02,
     2.52527006e-03,   2.61395134e+00,  -1.34372897e+01,
     3.83587282e+00,  -3.09303986e-03,  -1.41150803e+00,
     2.95533512e-01,  -1.26816221e-02,  -9.05375752e-01,
     9.61814775e-03,  -5.30553855e-01])

# 2. the closed form solution
print np.linalg.inv(Xp.T.dot(Xp) + alpha * np.eye(Xp.shape[1])).dot(Xp.T).dot(y)
# output:
# array([  2.17772079e+01,  -1.00258044e-01,   4.76559911e-02,
    -6.63573226e-04,   2.68040479e+00,  -9.55123875e+00,
     4.55214996e+00,  -4.67446118e-03,  -1.25507957e+00,
     2.52066137e-01,  -1.15766049e-02,  -7.26125030e-01,
     1.14804636e-02,  -4.92130481e-01])

Upvotes: 0

Answers (2)

Ami Tavory

Reputation: 76297

You are correct that the analytical solution is

(X' X + α I)^-1 X'y,

but the question is what are X and y. There are actually two different interpretations:

In your analytical calculation, you're actually using X_p where a column of 1s has been prepended to X (for the intercept), and using the original y. This is what you're feeding into the above equation.
In sklearn, the interpretation is different. First y_n is normalized by subtracting its mean (which is the intercept). Then, the calculation is performed on X and y_n.

It's clear why you thought your interpretation is correct, as in OLS there is no difference. When you add Ridge penalties, though, your interpretation is penalizing also the coefficient of the first column, which doesn't make that much sense.

If you do the following

alpha = 0.5
ridge = Ridge(fit_intercept=True, alpha=alpha)
ridge.fit(X, y - np.mean(y))
# 1. intercept and coef of the fit model
print np.array([ridge.intercept_] + list(ridge.coef_))


Xp = Xp - np.mean(Xp, axis=0)
# 2. the closed form solution
print np.linalg.inv(Xp.T.dot(Xp) + alpha * np.eye(Xp.shape[1])).dot(Xp.T).dot(y)

then you'll see the same results.

Upvotes: 4

lejlot

Reputation: 66775

The tricky bit is intercept. The closed form solution you have is for lack of intercept, when you append a column of 1s to your data you also add L2 penalty onto the intercept term. Scikit-learn ridge regression does not.

If you want to have L2 penalty on the bias then simply call ridge on Xp (and turn off fitting bias in the constructor) and you get:

>>> ridge = Ridge(fit_intercept=False, alpha=alpha)
>>> ridge.fit(Xp, y)
>>> print np.array(list(ridge.coef_))
[  2.17772079e+01  -1.00258044e-01   4.76559911e-02  -6.63573226e-04
   2.68040479e+00  -9.55123875e+00   4.55214996e+00  -4.67446118e-03
  -1.25507957e+00   2.52066137e-01  -1.15766049e-02  -7.26125030e-01
   1.14804636e-02  -4.92130481e-01]

Upvotes: 4

Understanding Ridge Linear Regression in sci-kit learn

Answers (2)

Related Questions