

Feature-selection and prediction

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

I have X and Y data.

data = load_iris()    
X =
Y = 

I would like to implement RFECV feature selection and prediction with k-fold validation approach.

code corrected from the answer @

clf = RandomForestClassifier()

kf = KFold(n_splits=2, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', clf)]

class Mypipeline(Pipeline):
    def coef_(self):
        return self._final_estimator.coef_
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, cv=kf, scoring='accuracy', verbose=10)
rfecv_data =, Y)

print ('no. of selected features =', rfecv_data.n_features_) 

EDIT (for small remaining):

X_new = rfecv.transform(X)
print X_new.shape

y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

Upvotes: 2

Views: 1149

Answers (2)

Vivek Kumar
Vivek Kumar

Reputation: 36619

Instead of wrapping StandardScaler and RFECV in a same pipeline, do that for StandardScaler and RandomForestClassifier and pass that pipeline to the RFECV as an estimator. In this no traininf info will be leaked.

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

pipeline = Pipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data =, Y)

Update: About the error 'RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes'

Yes thats a known issue in scikit-learn pipeline. You can look at my other answer here for more details and use the new pipeline I created there.

Define a custom pipeline like this:

class Mypipeline(Pipeline):
    def coef_(self):
        return self._final_estimator.coef_
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

And use that:

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data =, Y)

Update 2:

@brute, For your data and code, the algorithms completes within a minute on my PC. This is the complete code I use:

import numpy as np
import glob
from sklearn.utils import resample
files = glob.glob('/home/Downloads/Untitled Folder/*') 
outs = [] 
for fi in files: 
    data = np.genfromtxt(fi, delimiter='|', dtype=float) 
    data = data[~np.isnan(data).any(axis=1)] 
    data = resample(data, replace=False, n_samples=1800, random_state=0) 

X = np.vstack(outs) 
print X.shape 
Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800) 
print Y.shape

#from sklearn.utils import shuffle
#X, Y = shuffle(X, Y, random_state=0)

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = RandomForestClassifier()

kf = KFold(n_splits=10, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

class Mypipeline(Pipeline):
    def coef_(self):
        return self._final_estimator.coef_
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
rfecv_data =, Y)

print ('no. of selected features =', rfecv_data.n_features_) 

Update 3: For cross_val_predict

X_new = rfecv.transform(X)
print X_new.shape

# Here change clf to pipeline, 
# because RFECV has found features according to scaled data,
# which is not present when you pass clf 
y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

Upvotes: 3

Ekaba Bisong
Ekaba Bisong

Reputation: 2982

Here's how we'll do it:

Fit on the training set

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

data = load_iris()    
X =, Y =

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, shuffle=True)

# create model
clf = RandomForestClassifier()    
# instantiate K-Fold
kf = KFold(n_splits=10, shuffle=True, random_state=0)

# pipeline estimators
estimators = [('standardize' , StandardScaler()),
             ('rfecv', RFECV(estimator=clf, cv=kf, scoring='accuracy'))]

# instantiate pipeline
pipeline = Pipeline(estimators)    
# fit rfecv to train model
rfecv_model = rfecv_model =, y_train)

# print number of selected features
print ('no. of selected features =', pipeline.named_steps['rfecv'].n_features_)
# print feature ranking
print ('ranking =', pipeline.named_steps['rfecv'].ranking_)

no. of selected features = 3
ranking = [1 2 1 1]

Predict on the test set

# make predictions on the test set
predictions = rfecv_model.predict(X_test)

# evaluate the model performance using accuracy metric
print("Accuracy on test set: ", accuracy_score(y_test, predictions))

Accuracy:  0.9736842105263158

Upvotes: -1

Related Questions