Reputation: 159
I am getting two different sets of p-values when I use scikit learn's
predKbest= SelectKBest(sklearn.feature_selection.f_regression, k=i).fit(X_train, y_train)
predKbest.pvalues_
and
predKbest= SelectKBest(sklearn.feature_selection.chi2, k=i).fit(X_train, y_train)
predKbest.pvalues_
on the same data X_train and y_train. Are they supposed to be different p-values?
Upvotes: 2
Views: 1300
Reputation: 2062
SelectKBest will select, in your case, the top i variables by importance, based on the test that you input : Fischer or Chi2.
F_regression is used for regression while chi2 is used for classification so it's quite strange that you use both of them with the same input variables. You should take a step back and study what you really are looking to do. Everything is well explained here : http://scikit-learn.org/stable/modules/feature_selection.html
f_regression p_value wil calculate the linear dependancy between each regressor and the target.
chi2 test "measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification." X must contains non-negative features such as booleans or frequencies.
The p_values that you print are here the results of the chi2 and f_regression that are "transformed" into p_values, so it's 100% normal that they are different.
Upvotes: 2