ubuntu_noob
ubuntu_noob

Reputation: 2365

What is the p values taken by sklearn chi2

I am trying to understand the implementation of the sklearn chi2 for feauture selection algorithm. I think I understand the chi2 formula. enter image description here

enter image description here

After getting this value we will see the table for 1 degree of freedom and according to ou need choose the p value.If chi2 value is greater than keep it otherwise ignore it.

My question is how does sklearn package choose this p-value on its own?It just requires X and y array as input.

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html

and are the chi2 scores always normalized? link to the paper-http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf

Upvotes: 0

Views: 1633

Answers (1)

Jan K
Jan K

Reputation: 4150

The idea is to perform univariate feature selection:

  1. For each feature, you compute some kind of a statistic (in your case a chisquared-statistic)
  2. Create a set of (hopefully) most important features by combining 1) with some selection method (SelectKBest, SelectPercentile)

So going back to your question, I think you misunderstood the following point:

  • You always run the chi2 test for all features and then you only keep those that had highest chi2-statistic (=lowest p value). You do not specify a cutoff threshold since your is goal to keep the most informative features. If you insist and really want to use a cutoff threshold then you would have to write your own Transformer and it is not obvious at all what this cutoff value should be (even when applied to p-values).

Upvotes: 1

Related Questions