user2738815
user2738815

Reputation: 1286

Coefficient paths for Ridge Regression in scikit-learn

Starting with a pandas DataFrame, d_train (774 rows):

enter image description here

The idea is to follow the example here to investigate Ridge coefficient paths.

In that example, here are the variable types:

X, y, w = make_regression(n_samples=10, n_features=10, coef=True,
                          random_state=1, bias=3.5)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)

>> (10, 10) <type 'numpy.ndarray'> (10,) <type 'numpy.ndarray'> (10,) <type'numpy.ndarray'>

To avoid the issue mentioned in this stackoverflow discussion, I convert everything to numpy arrays:

predictors = ['p1', 'p2', 'p3', 'p4']
target = ['target_bins']
X = d_train[predictors].as_matrix()
### X = np.transpose(d_train[predictors].as_matrix())
y = d_train['target_bins'].as_matrix()
w = numpy.full((774,), 3, dtype=float)
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
>> (774, 4) <type 'numpy.ndarray'> y_shape: (774,) <type 'numpy.ndarray'>     w_shape: (774,) <type 'numpy.ndarray'>

And then I just ran (a) the exact code in the example, (b) adding the parameters fit_intercept = True, normalize = True to the ridge call (my data is not scaled) to get the same error message:

my_ridge = Ridge()
coefs = []
errors = []
alphas = np.logspace(-6, 6, 200)

for a in alphas:
    my_ridge.set_params(alpha=a, fit_intercept = True, normalize = True)
    my_ridge.fit(X, y)
    coefs.append(my_ridge.coef_)
    errors.append(mean_squared_error(my_ridge.coef_, w))
>> ValueError: Found input variables with inconsistent numbers of samples: [4, 774]

As the commented out section of the code indicates, I also tried the "same" code but with a transposed X matrix. I also tried scaling the data before creating the X matrix. Got the same error message.

Finally, I did the same thing using 'RidgeClassifier', and manged to get a different error message.

>> Found input variables with inconsistent numbers of samples: [1, 774]

Question: I have no idea what is going on here--can you please help?

Using python 2.7 on Canopy 1.7.4.3348 (64 bit) with scikit-learn 18.01-3 and pandas 0.19.2-2

Thank you.

Upvotes: 1

Views: 2177

Answers (1)

Sandipan Dey
Sandipan Dey

Reputation: 23101

You need to have as many weights w as you have number of features (since you learn a single weight per feature), but in your code the dimension of the weight vector is 774 (which is number of rows in the training dataset), that's why it did not work. Modify the code to the following (to have 4 weights instead) and everything will work:

w = np.full((4,), 3, dtype=float) # number of features = 4, namely p1, p2, p3, p4
print X.shape, type(X), y.shape, type(y), w.shape, type(w)
#(774L, 4L) <type 'numpy.ndarray'> (774L,) <type 'numpy.ndarray'> (4L,) <type 'numpy.ndarray'>

Now you can run the rest of the code from http://scikit-learn.org/stable/auto_examples/linear_model/plot_ridge_coeffs.html#sphx-glr-auto-examples-linear-model-plot-ridge-coeffs-py to see how the weights and the errors vary with the regularization parameter alpha with grid-search and obtain the following figures

enter image description here

Upvotes: 1

Related Questions