Scipy and Sklearn chi2 implementations give different results

I an using sklearn.feature_selection.chi2 for feature selection and found out some unexpected results (check the code). Do anyone knows what is the reason or can point me to some documentation or pull request?

I include a comparison of the results I got and the expected ones obtained by hand and using scipy.stats.chi2_contingency.

The code:

import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
from sklearn.feature_selection import chi2, SelectKBest

x = np.array([[1, 1, 1, 0, 1], [1, 0, 1, 0, 0], [0, 0, 1, 1, 1], [0, 0, 1, 1, 0], [0, 0, 0, 1, 1], [0, 0, 0, 1, 0]])
y = np.array([1, 1, 2, 2, 3, 3])

scores = []
for i in range(x.shape[1]):
    result = chi2_contingency(pd.crosstab(x[:, i], y))
    scores.append(result[0])

sel = SelectKBest(score_func=chi2, k=3)
sel.fit(x, y)

print(scores)
print(sel.scores_)
print(sel.get_support())

The results are:

[6., 2.4, 6.0, 6.0, 0.0] (Expected)
[4. 2. 2. 2. 0.] (Unexpected)
[ True  True False  True False]

Using scipy, it keeps features 0, 2, 3, while, with sklearn it keeps features 0,1,3.

Upvotes: 3

Views: 1677

Answers (2)

Data Man
Data Man

Reputation: 51

Yes, they do give different results. And I think you should trust the results from scipy, and reject the results from sklearn.

But let me provide details of my reasoning, because I could be wrong.

I lately observed a similar effect to what you describe, with a data set of 300 data points: the results of the two chi2 implementations differ indeed. In my case the difference was striking. I described the issue in details in this article , followed by this Cross Validated discussion thread and I also submitted a bug request to sklearn, available for review here.

The added value from my research, if any, seems to be that the results delivered by the scipy implementation seem correct, while the results from sklearn are incorrect. Please see the article for the details. But I only focused on my sample, so the conclusion may not be universally true. Sadly the source code analysis is beyond my capability, but I hope this input can help someone to possibly either improve the code, or disprove my reasoning if wrong.

Upvotes: 3

Gambit1614
Gambit1614

Reputation: 8801

First, you have the observed values and expected values interchanges when calculating with the scipy implementation, it should be

scores = []
for i in range(x.shape[1]):
    result = chi2_contingency(pd.crosstab(y,x[:,i] ))
    scores.append(result[0])

So now the scipy results are :

[6.000000000000001, 2.4000000000000004, 6.000000000000001, 6.000000000000001, 0.0]

While the one with sklearn's chi2 are

[4. 2. 2. 2. 0.]

Now I went into the source code, and they both calculate the chi square values little differently

The sklearn implementation You can check line 171 where chi2 class is defined, this the implementation in sklearn before being passed to _chisquare class.

scipy implementation You can view the scipy implementation here,which calls this function to finally calculate the chi square values.

As you can see from the implementation the difference in values is because of the transformations they perform on the obsevred and expected values before calculating the chi square values.

References:

Upvotes: 2

Related Questions