Reputation: 105
I am not able to use SelectKBest for feature selection in a Pipeline concluding with MultiOutputRegressor (see below, where pipe1 works fine, but pipe2 leads to an error -- given below). It seems SelectKBest is not able to handle y with multiple columns. Is this a known limitation?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures
X = np.random.normal(0,1,(100,10))
y = np.random.normal(0,1,(100,2))
pipe1 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
('regr', MultiOutputRegressor(Ridge()))])
pipe1.fit(X, y)
pipe2 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
('kbst', SelectKBest(f_regression, k=5)), \
('regr', MultiOutputRegressor(Ridge()))])
pipe2.fit(X, y)
Here is the error message:
ValueError
---> 17 pipe2.fit(X, y) [...] /home/ecm/.conda/envs/mlpolar/lib/python3.7/site-packages/sklearn/utils/validation.py
in column_or_1d(y, warn)
845 raise ValueError(
846 "y should be a 1d array, "
--> 847 "got an array of shape {} instead.".format(shape))
848
849
ValueError: y should be a 1d array, got an array of shape (100, 2) instead.
Upvotes: 1
Views: 592
Reputation: 105
Based on the suggested post indicated in the comments, here is how to fix the problem:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures
X = np.random.normal(0,1,(100,10))
y = np.random.normal(0,1,(100,2))
pipe_in = Pipeline([('kbst', SelectKBest(f_regression, k=5)), \
('regr', Ridge())])
pipe2 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
('pipe', MultiOutputRegressor(pipe_in))])
pipe2.fit(X, y)
Upvotes: 2