Reputation: 879
I am trying to understand how the chi2 function is computed for the following input.
sklearn.feature_selection.chi2([[1, 2, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 2, 1]], [True, False, False])
I get the following result [2, 4, 0.5, 1, 0.25]
for chi2.
I found already the following formula for its computation on wikipedia (x_i also being referred to as observed and m_i referred to as expected) but I do not know, how to apply it.
What I understand is that I have three categories of input (rows) and four features (columns) and the chi2 function returns whether there is a correlation between the feature and the class. The feature represented by the first column occurs twice in the first category and gets a chi2 value of 4.
What I think I have figured out is that
False
seem to be somehow combined but I have not yet figured out how.If anybody can help me out that would be highly appreciated. Thanks!
Upvotes: 1
Views: 496
Reputation: 131
It seems like there is misunderstanding of 2 different types of chi2.
Tom asked why chi2 returns [2, 4, 0.5, 1, 0.25]
whereas scipy_cont, scipy_stats calculations all return [3, 3, 0.75, 0.75, 0.75]
.
The answer is quite simple:
For more information read the following: Die example
Upvotes: 1
Reputation: 727
The calculation in sklearn.feature_selection.chi2
differs from a typical textbook example you find for a chi square test of independence (For such a classic chi square test, see the manual calculation I provide below).
sklearn.feature_selection.chi2
(source code) : suppose we have a target variable y (categorical, say 0, 1, 2) and a non-negative continuous variable x (say, anywhere between 0 to 100) and we want to test the independence between x and y (e.g., y being independent of x means x is not useful as a predictive feature). This algorithm calculates a group sum of x given y (say, sum_x_y0, sum_x_y1, sum_x_y2 -- call them observed) and compare these observed values to probability-weight grand total of x (say, prob_y0*x_tot, prob_y1*x_tot, prob_y2*x_tot -- call them expected) using a chi-square test with (k-1) degrees of freedom for k categories in y. Because it uses the chi-square test, it cannot have negative sums in its calculations, as I imagine. (I am not sure if there is an academic reference for this but the approach seems to make sense.)
This is a sample code from sklearn user guide for selecting features using chi2
.
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
print(X.shape)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
For a classic chi square test of independence between two categorical variables, here is my manual calculation code and it seems to match with scipy
chi square calculations. The formula I used is the same as you posted above but dof is (levels of var in x - 1) and (levels of y - 1).
from sklearn.feature_selection import chi2
x = [[1, 2, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 2, 1]]
y = [True, False, False]
chi2(x,y)[0]
import numpy as np
def is_val_eq(vec, val): return [i==val for i in vec]
def chi_E(vec1, vec1_val, vec2, vec2_val):
num1 = sum(is_val_eq(vec1, vec1_val))
num2 = sum(is_val_eq(vec2, vec2_val))
return num1*num2/len(vec1)
def chi_O(vec1, vec1_val, vec2, vec2_val):
idx1 = is_val_eq(vec1, vec1_val)
idx2 = is_val_eq(vec2, vec2_val)
return sum(np.logical_and(idx1, idx2))
def chi_inside(O, E): return (O-E)**2/E
def chi_square(Os, Es): return sum([chi_inside(O,E) for O,E in zip(Os, Es)])
def get_col(x, col): return [row[col] for row in x]
def calc_chi(vec_x, vec_y):
val_xs = set(vec_x)
val_ys = set(vec_y)
Es = [chi_E(vec_x, val_x, vec_y, val_y)
for val_x in val_xs for val_y in val_ys]
Os = [chi_O(vec_x, val_x, vec_y, val_y)
for val_x in val_xs for val_y in val_ys]
return chi_square(Os, Es), Es, Os
from scipy.stats import chi2_contingency
from scipy import stats
chi_calc = dict(manual=[], scipy_cont=[], scipy_stats=[])
for idx_feature in range(5):
chi_sq, Es, Os = calc_chi(get_col(x, idx_feature), y)
chi_calc['manual'].append(chi_sq)
data = [Os[0:2], Os[2:4]]
stat, p, dof, expected = chi2_contingency(data, correction=False)
chi_calc['scipy_cont'].append(stat)
result = stats.chisquare(data, f_exp = expected, ddof = 1, axis=None)
chi_calc['scipy_stats'].append(result.statistic)
Intuitively, if you are trying to test the independence of categorical-variable columns of x
with respect to y
, the first two columns of x
should be give the same statistic (since they are just scaled version of one another and hence identical in terms of categorical levels).
Upvotes: 1
Reputation: 879
I just looked into the sources of scikit-learn. And the calculation is actually fairly straight-forward. In my example, we have two classes (True and False). For the second class, we have two samples ([0, 0, 1, 0, 0]
and [0, 0, 0, 2, 1]
).
We first some up the columns for each class which gives the observed values:
True: [1, 2, 0, 0, 1]
False: [0, 0, 1, 2, 1]
To compute the expected values, we compute the sum of all columns (i.e., the total count that the feature was observed over all classes) which gives [1, 2, 1, 2, 2]
. If we assume there is no correlation between a feature and the class it was found in, the distribution must be according of these values must correspond to the number of samples we have. I.e., 1/3
of the values should be found in the True
class and 2/3
in the False
class, which gives the expected values:
True: 1/3 * [1, 2, 1, 2, 2] = [1/3 2/3 1/3 2/3 2/3]
False: 2/3 * [1, 2, 1, 2, 2] = [2/3 4/3 2/3 4/3 4/3]
Now chi2 can be computed for each column, as an example for the most interesting last column:
(1-2/3)^2 / (2/3) + (1-4/3)^2 / (4/3) = 1/6 + 1/12 = 1/4 = 0.25
The error of 0.25 is relatively small, therefore, as one would expect, this feature is independent from the class.
Upvotes: 1