Eric Chassande-Mottin
Eric Chassande-Mottin

Reputation: 105

Issue with SelectKBest() when using with MultiOutputRegressor

I am not able to use SelectKBest for feature selection in a Pipeline concluding with MultiOutputRegressor (see below, where pipe1 works fine, but pipe2 leads to an error -- given below). It seems SelectKBest is not able to handle y with multiple columns. Is this a known limitation?

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures

X = np.random.normal(0,1,(100,10))
y = np.random.normal(0,1,(100,2))

pipe1 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
                     ('regr', MultiOutputRegressor(Ridge()))])

pipe1.fit(X, y)

pipe2 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
                 ('kbst', SelectKBest(f_regression, k=5)), \
                     ('regr', MultiOutputRegressor(Ridge()))])

pipe2.fit(X, y)

Here is the error message:

 ValueError
 ---> 17 pipe2.fit(X, y) [...] /home/ecm/.conda/envs/mlpolar/lib/python3.7/site-packages/sklearn/utils/validation.py
 in column_or_1d(y, warn)
     845     raise ValueError(
     846         "y should be a 1d array, "
 --> 847         "got an array of shape {} instead.".format(shape))
     848 
     849 
 
 ValueError: y should be a 1d array, got an array of shape (100, 2) instead.

Upvotes: 1

Views: 592

Answers (1)

Eric Chassande-Mottin
Eric Chassande-Mottin

Reputation: 105

Based on the suggested post indicated in the comments, here is how to fix the problem:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import PolynomialFeatures

X = np.random.normal(0,1,(100,10))
y = np.random.normal(0,1,(100,2))

pipe_in = Pipeline([('kbst', SelectKBest(f_regression, k=5)), \
                     ('regr', Ridge())])
pipe2 = Pipeline([('poly', PolynomialFeatures(2, include_bias=False)), \
                     ('pipe', MultiOutputRegressor(pipe_in))])

pipe2.fit(X, y)

Upvotes: 2

Related Questions