Coefficient paths for Ridge Regression in scikit-learn

Question

Starting with a pandas DataFrame, d_train (774 rows):

The idea is to follow the example here to investigate Ridge coefficient paths.

In that example, here are the variable types:

X, y, w = make_regression(n_samples=10, n_features=10, coef=True,
                          random_state=1, bias=3.5)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)

>> (10, 10)  (10,)  (10,)

To avoid the issue mentioned in this stackoverflow discussion, I convert everything to numpy arrays:

predictors = ['p1', 'p2', 'p3', 'p4']
target = ['target_bins']
X = d_train[predictors].as_matrix()
### X = np.transpose(d_train[predictors].as_matrix())
y = d_train['target_bins'].as_matrix()
w = numpy.full((774,), 3, dtype=float)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
>> (774, 4)  y_shape: (774,)      w_shape: (774,)

And then I just ran (a) the exact code in the example, (b) adding the parameters fit_intercept = True, normalize = True to the ridge call (my data is not scaled) to get the same error message:

my_ridge = Ridge()
coefs = []
errors = []
alphas = np.logspace(-6, 6, 200)

for a in alphas:
    my_ridge.set_params(alpha=a, fit_intercept = True, normalize = True)
    my_ridge.fit(X, y)
    coefs.append(my_ridge.coef_)
    errors.append(mean_squared_error(my_ridge.coef_, w))
>> ValueError: Found input variables with inconsistent numbers of samples: [4, 774]

As the commented out section of the code indicates, I also tried the "same" code but with a transposed X matrix. I also tried scaling the data before creating the X matrix. Got the same error message.

Finally, I did the same thing using 'RidgeClassifier', and manged to get a different error message.

>> Found input variables with inconsistent numbers of samples: [1, 774]

Question: I have no idea what is going on here--can you please help?

Using python 2.7 on Canopy 1.7.4.3348 (64 bit) with scikit-learn 18.01-3 and pandas 0.19.2-2

Thank you.

Sandipan Dey · Accepted Answer

You need to have as many weights w as you have number of features (since you learn a single weight per feature), but in your code the dimension of the weight vector is 774 (which is number of rows in the training dataset), that's why it did not work. Modify the code to the following (to have 4 weights instead) and everything will work:

w = np.full((4,), 3, dtype=float) # number of features = 4, namely p1, p2, p3, p4
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
#(774L, 4L)  (774L,)  (4L,)

Now you can run the rest of the code from http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_coeffs.html#sphx-glr-auto-examples-linear-model-plot-ridge-coeffs-py to see how the weights and the errors vary with the regularization parameter alpha with grid-search and obtain the following figures

Coefficient paths for Ridge Regression in scikit-learn

Answers (1)

Related Questions