towi_parallelism
towi_parallelism

Reputation: 1481

Right way to use RFECV and Permutation Importance - Sklearn

There is a proposal to implement this in Sklearn #15075, but in the meantime, eli5 is suggested as a solution. However, I'm not sure if I'm using it the right way. This is my code:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
import eli5
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
perm = eli5.sklearn.PermutationImportance(estimator,  scoring='r2', n_iter=10, random_state=42, cv=3)
selector = RFECV(perm, step=1, min_features_to_select=1, scoring='r2', cv=3)
selector = selector.fit(X, y)
selector.ranking_

There are a few issues:

  1. I am not sure if I am using cross-validation the right way. PermutationImportance is using cv to validate importance on the validation set, or cross-validation should be only with RFECV? (in the example, I used cv=3 in both cases, but not sure if that's the right thing to do)

  2. If I run eli5.show_weights(perm), I'll get: AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'. Is this because I fit using RFECV? what I'm doing is similar to the last snippet here: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

  3. as a less important issue, this gives me a warning when I set cv in eli5.sklearn.PermutationImportance :

.../lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass classifier=False as keyword args. From version 0.25 passing these as positional arguments will result in an error warnings.warn("Pass {} as keyword args. From version 0.25 "

The whole process is a bit vague. Is there a way to do it directly in Sklearn? e.g. by adding a feature_importances attribute?

Upvotes: 5

Views: 3516

Answers (2)

Marco Cerliani
Marco Cerliani

Reputation: 22031

You can directly compute RFECV using sklearn by building your estimator that computes feature importance, using any logic you want, when calling fit.

If you want to compute feature importance based on permutation using an SVR regressor, the estimator you have to implement is:

from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

class SVRExplainerRegressor(SVR):
    def fit(self, X,y):
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=0.25, random_state=42, shuffle=True
        )
        super().fit(X_train,y_train)
        
        self.perm_feature_importances_ = permutation_importance(
            self, X_val, y_val, 
            n_repeats=5, random_state=42,
        )['importances_mean']
        
        return super().fit(X,y)

SVRExplainerRegressor does the following:

  • split the received data into training and validation
  • training is used to fit an SVR regressor
  • validation is used to compute and store the feature importances using the permutation technique
  • in the end, a final training is computed using all the data received

SVRExplainerRegressor can be used like any sklearn model as RFECV's estimator in this way:

from sklearn.feature_selection import RFECV
from sklearn.datasets import make_friedman1

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

model = SVRExplainerRegressor(kernel='linear')
selector = RFECV(model, step=1, min_features_to_select=1, 
                 importance_getter='perm_feature_importances_', 
                 scoring='r2', cv=3)
selector.fit(X, y)

This logic can be customized using any estimator (both regressor or classifier) and any feature importance logic (like SHAP or similar)

Upvotes: 1

afsharov
afsharov

Reputation: 5174

Since the objective is to select the optimal number of features with permutation importance and recursive feature elimination, I suggest using RFECV and PermutationImportance in conjunction with a CV splitter like KFold. The code could then look like this:

import warnings
from eli5 import show_weights
from eli5.sklearn import PermutationImportance
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.svm import SVR


warnings.filterwarnings("ignore", category=FutureWarning)

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

splitter = KFold(n_splits=3) # 3 folds as in the example

estimator = SVR(kernel="linear")
selector = RFECV(
    PermutationImportance(estimator,  scoring='r2', n_iter=10, random_state=42, cv=splitter),
    cv=splitter,
    scoring='r2',
    step=1
)
selector = selector.fit(X, y)
selector.ranking_

show_weights(selector.estimator_)

Regarding your issues:

  1. PermutationImportance will calculate the feature importance and RFECV the r2 scoring with the same strategy according to the splits provided by KFold.

  2. You called show_weights on the unfitted PermutationImportance object. That is why you got an error. You should access the fitted object with the estimator_ attribute instead.

  3. Can be ignored.

Upvotes: 3

Related Questions