Reputation: 159

Scikit Learn feature_selection giving different p-values

I am getting two different sets of p-values when I use scikit learn's

predKbest= SelectKBest(sklearn.feature_selection.f_regression, k=i).fit(X_train, y_train)

predKbest.pvalues_

and

predKbest= SelectKBest(sklearn.feature_selection.chi2, k=i).fit(X_train, y_train)    
predKbest.pvalues_

on the same data X_train and y_train. Are they supposed to be different p-values?

Upvotes: 2

Answers (1)

Mohamed AL ANI

Reputation: 2062

SelectKBest will select, in your case, the top i variables by importance, based on the test that you input : Fischer or Chi2.

F_regression is used for regression while chi2 is used for classification so it's quite strange that you use both of them with the same input variables. You should take a step back and study what you really are looking to do. Everything is well explained here : http://scikit-learn.org/stable/modules/feature_selection.html

f_regression p_value wil calculate the linear dependancy between each regressor and the target.

chi2 test "measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification." X must contains non-negative features such as booleans or frequencies.

The p_values that you print are here the results of the chi2 and f_regression that are "transformed" into p_values, so it's 100% normal that they are different.

Upvotes: 2

Scikit Learn feature_selection giving different p-values

Answers (1)

Related Questions