Reputation: 4608
I want to perform recursive feature elimination with cross validation (rfecv)
in 10-fold cross validation (i.e. cross_val_predict
or cross_validate
) in sklearn.
Since rfecv
itself has a cross validation part in its name, I am not clear how to do it. My current code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
rfecv = RFECV(estimator=clf, step=1, cv=k_fold)
Please let me know how I can use the data X
and y
with rfecv
in 10-fold cross validation
.
I am happy to provide more details if needed.
Upvotes: 0
Views: 1510
Reputation: 88236
To perform feature selection with RFE
and then fit a rf
with 10 fold cross validation, here's how you could do it:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
rf = RandomForestClassifier(random_state = 0, class_weight="balanced")
rfe = RFE(estimator=rf, step=1)
Now transform the original X
by fitting with the RFECV
:
X_new = rfe.fit_transform(X, y)
Here are the ranked features (not much of a problem with only 4 of them):
rfe.ranking_
# array([2, 3, 1, 1])
Now split into train and test data and perform a cross validation in conjunction with a grid search using GridSearchCV
(they usually go together):
X_train, X_test, y_train, y_test = train_test_split(X_new,y,train_size=0.7)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
grid_clf = GridSearchCV(rf, param_grid, cv=k_fold.split(X_train, y_train))
grid_clf.fit(X_train, y_train)
y_pred = grid_clf.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[17, 0, 0],
[ 0, 11, 0],
[ 0, 3, 14]], dtype=int64)
Upvotes: 0
Reputation: 60318
To use recursive feature elimination in conjunction with a pre-defined k_fold
, you should use RFE
and not RFECV
:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
selector = RFE(clf, 5, step=1)
cv_acc = []
for train_index, val_index in k_fold.split(X, y):
selector.fit(X[train_index], y[train_index])
pred = selector.predict(X[val_index])
acc = accuracy_score(y[val_index], pred)
cv_acc.append(acc)
cv_acc
# result:
[1.0,
0.9333333333333333,
0.9333333333333333,
1.0,
0.9333333333333333,
0.9333333333333333,
0.8666666666666667,
1.0,
0.8666666666666667,
0.9333333333333333]
Upvotes: 1