Reputation: 43
I used mutual_info_classif and selectPercentile from sklearn to do the feature selection in a dataset. I found that I can set the random_state to be 0 to make sure selected features can be same every time such like below the code:
mi = mutual_info_classif(X_train, y_train, random_state=0)
print(mi)
sel_mi = SelectPercentile(mutual_info_classif, percentile=10).fit(X_train,y_train)
Another one, I do not need to set random_state and make it be default one. But this will make every selection will be different.
mi = mutual_info_classif(X_train, y_train)
I want to know that if the feature selection every time will be same, how can I judge if it is the best feature choices?
If the selection is different every time, whether does it mean that this kind of feature selection is meaningless?
Upvotes: 4
Views: 2325
Reputation: 25249
ML is more of an art than of a science. Some algos will return always the same, regardless of seed, e.g. linear regression, others, e.g. decision trees, will return different results depending on subsample, and some, e.g. random forest, may return different results even on the same subsample depending on seed.
An algo, returning different results depending on subsample, means your algo depends on the data distribution, and that may change depending on seed your're providing. It doesn't mean your algo is useless. You may pay more attention to features that consistently appear most important regradless of data subsample you're providing.
You may get a more consistent results by providing more data, thus, getting more consistent results from sampling in terms of data distribution.
Final remark. Feature importance might seem an important exercise in terms of exploring your data, what you have to pay more attention to while collecting or cleaning your data. But it's not so important in terms of model building, as most algos have built in mechanisms of feature selection.
Upvotes: 2