Reputation: 348
I want to use sklearn.feature_selection.SelectFromModel
to extract features in a multi-step regression problem. The regression problem predicts multiple values using the MultiOutputRegressor
in combination with the RandomForestRegressor
. When I try to get the selected features with SelectFromModel.get_support()
it gives an error indicating that I need to make some feature_importances_
accessible for the method to work.
It is possible to access feature_importances_
of MultiOutputRegressor
as indicated in this question. However I am unsure on how to pass these feature_importances_
correctly to the SelectFromModel
class.
Here is what I did so far:
# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)
# get important features for prediction problem:
from sklearn.multioutput import MultiOutputRegressor
regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)
sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
sel.fit(X_train, y_train)
sel.get_support()
# to get feature_importances_ of Multioutputregressor use line:
regr_multirf.estimators_[1].feature_importances_
Output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-72-a1d635ad4a34> in <module>()
5 sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
6 sel.fit(X_train, y_train)
----> 7 sel.get_support()
2 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_from_model.py in _get_feature_importances(estimator, norm_order)
30 "`feature_importances_` attribute. Either pass a fitted estimator"
31 " to SelectFromModel or call fit before calling transform."
---> 32 % estimator.__class__.__name__)
33
34 return importances
ValueError: The underlying estimator MultiOutputRegressor has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
Any help and hints would be appreciated.
Upvotes: 2
Views: 1405
Reputation: 775
In MultiOutputRegressors from sklearn, each target is fitted with its own model, as stated in the documentation: "This strategy consists of fitting one regressor per target.". That means that you need to compute the feature importance for every single of the random forest regressors in your MultiOutputRegressor.
The feature importances for each regressor are not saved directly in the MultiOutputRegressor. Instead, you can extract each regressor (or also called estimator) from the fitted MultiOutputRegressor object via
regr_multirf.estimators_[# of regressor you want]
if regr_multirf
is your fitted MultiOutputRegressor.
Therefore, you do not need SelectFromModel
to retrieve the feature importance for a MultiOutput sklearn regression model but work directly with each estimator as explained in this question, on which this answer is also heavily based. Your approach would only work for methods that inherently can predict for multivariate targets and do not train a single model for each of the targets.
In your case, you can retrieve the feature importance directly from your fitted regressor regr_multirf
via
# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.feature_selection import SelectFromModel
import numpy as np
import pandas as pd
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)
regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)
# now extract the estimator from your regression model
# this estimator carries the feature importances
# you're interested in
# You can also loop the following code
# over all your targets
no_est = 0 # index of target you want feature importance for
# get estimator
est = regr_multirf.estimators_[0]
# get feature importances
feature_importances = pd.DataFrame(est.feature_importances_,
columns=['importance']).sort_values('importance')
print(feature_importances)
feature_importances.plot(kind = 'barh')
Output:
Upvotes: 2