Reputation: 409
I want to apply a wrapper-method like Recursive Feature Elimination on my regression problem with scikit-learn. Recursive feature elimination with cross-validation gives a good overview, how to tune the number of features automatically.
I tried this:
modelX = LogisticRegression()
rfecv = RFECV(estimator=modelX, step=1, scoring='mean_absolute_error')
rfecv.fit(df_normdf, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()`
but I receive an error message like
`The least populated class in y has only 1 members, which is too few.
The minimum number of labels for any class cannot be less than n_folds=3. % (min_labels, self.n_folds)), Warning)
The warning sounds like I have a classification problem, but my task is a regression problem. What can I do to get a result and what's wrong?
Upvotes: 2
Views: 3957
Reputation: 5921
Here is what happened :
By default, when the number of folds is not indicated by the user, the Cross-Validation of the RFE
uses a 3-fold
cross-validation. So far so good.
However, if you look at the documentation, it also uses StartifiedKFold
which ensures that the folds are created by preserving the percentage of samples for each class. Therefore, since it seems (according to the error) that some elements of your output y
are unique, they cannot be at the same time in 3 different folds. It throws an error !
The error comes from here.
You need then to use unstratified K-fold : KFold
.
The documentation of RFECV
says that:
"If the estimator is a classifier or if y is neither binary nor multiclass, sklearn.model_selection.KFold is used."
Upvotes: 1