Tom
Tom

Reputation: 879

Computation of Chi-Square Test

I am trying to understand how the chi2 function is computed for the following input.

sklearn.feature_selection.chi2([[1, 2, 0, 0, 1],
                                [0, 0, 1, 0, 0],
                                [0, 0, 0, 2, 1]], [True, False, False])

I get the following result [2, 4, 0.5, 1, 0.25] for chi2.

I found already the following formula for its computation on wikipedia (x_i also being referred to as observed and m_i referred to as expected) but I do not know, how to apply it.

enter image description here

What I understand is that I have three categories of input (rows) and four features (columns) and the chi2 function returns whether there is a correlation between the feature and the class. The feature represented by the first column occurs twice in the first category and gets a chi2 value of 4.

What I think I have figured out is that

  1. the columns are independent of each other which makes sense
  2. if I omit the third row, the expected values would be sums of the columns and observed values simply the values in the respective cells, except this does not work for the last column
  3. the 2 columns with False seem to be somehow combined but I have not yet figured out how.

If anybody can help me out that would be highly appreciated. Thanks!

Upvotes: 1

Views: 496

Answers (3)

Asad Mehasi
Asad Mehasi

Reputation: 131

It seems like there is misunderstanding of 2 different types of chi2.

Tom asked why chi2 returns [2, 4, 0.5, 1, 0.25] whereas scipy_cont, scipy_stats calculations all return [3, 3, 0.75, 0.75, 0.75].

The answer is quite simple:

  • We use chi2 if we know the distribution of the target variable in advance.
  • If we do not, we use scipy_cont.

For more information read the following: Die example

Upvotes: 1

KM_83
KM_83

Reputation: 727

The calculation in sklearn.feature_selection.chi2 differs from a typical textbook example you find for a chi square test of independence (For such a classic chi square test, see the manual calculation I provide below).


sklearn.feature_selection.chi2 (source code) : suppose we have a target variable y (categorical, say 0, 1, 2) and a non-negative continuous variable x (say, anywhere between 0 to 100) and we want to test the independence between x and y (e.g., y being independent of x means x is not useful as a predictive feature). This algorithm calculates a group sum of x given y (say, sum_x_y0, sum_x_y1, sum_x_y2 -- call them observed) and compare these observed values to probability-weight grand total of x (say, prob_y0*x_tot, prob_y1*x_tot, prob_y2*x_tot -- call them expected) using a chi-square test with (k-1) degrees of freedom for k categories in y. Because it uses the chi-square test, it cannot have negative sums in its calculations, as I imagine. (I am not sure if there is an academic reference for this but the approach seems to make sense.)

This is a sample code from sklearn user guide for selecting features using chi2.

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X, y = load_iris(return_X_y=True)
print(X.shape)

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape

For a classic chi square test of independence between two categorical variables, here is my manual calculation code and it seems to match with scipy chi square calculations. The formula I used is the same as you posted above but dof is (levels of var in x - 1) and (levels of y - 1).

from sklearn.feature_selection import chi2
x = [[1, 2, 0, 0, 1],
     [0, 0, 1, 0, 0],
     [0, 0, 0, 2, 1]]
y = [True, False, False]

chi2(x,y)[0]
import numpy as np
def is_val_eq(vec, val): return [i==val for i in vec]

def chi_E(vec1, vec1_val, vec2, vec2_val): 
    num1 = sum(is_val_eq(vec1, vec1_val))
    num2 = sum(is_val_eq(vec2, vec2_val))
    return num1*num2/len(vec1)

def chi_O(vec1, vec1_val, vec2, vec2_val):
    idx1 = is_val_eq(vec1, vec1_val)
    idx2 = is_val_eq(vec2, vec2_val)
    return sum(np.logical_and(idx1, idx2))

def chi_inside(O, E): return (O-E)**2/E

def chi_square(Os, Es): return sum([chi_inside(O,E) for O,E in zip(Os, Es)])

def get_col(x, col): return [row[col] for row in x]

def calc_chi(vec_x, vec_y):
    val_xs = set(vec_x)
    val_ys = set(vec_y)
    Es = [chi_E(vec_x, val_x, vec_y, val_y) 
      for val_x in val_xs for val_y in val_ys]

    Os = [chi_O(vec_x, val_x, vec_y, val_y) 
      for val_x in val_xs for val_y in val_ys]
    return chi_square(Os, Es), Es, Os
from scipy.stats import chi2_contingency 
from scipy import stats

chi_calc = dict(manual=[], scipy_cont=[], scipy_stats=[])

for idx_feature in range(5):
    chi_sq, Es, Os = calc_chi(get_col(x, idx_feature), y)
    chi_calc['manual'].append(chi_sq)
    
    data = [Os[0:2], Os[2:4]]
    stat, p, dof, expected = chi2_contingency(data, correction=False)
    chi_calc['scipy_cont'].append(stat)
    
    result = stats.chisquare(data, f_exp = expected, ddof = 1, axis=None)
    chi_calc['scipy_stats'].append(result.statistic)

Intuitively, if you are trying to test the independence of categorical-variable columns of x with respect to y, the first two columns of x should be give the same statistic (since they are just scaled version of one another and hence identical in terms of categorical levels).

Upvotes: 1

Tom
Tom

Reputation: 879

I just looked into the sources of scikit-learn. And the calculation is actually fairly straight-forward. In my example, we have two classes (True and False). For the second class, we have two samples ([0, 0, 1, 0, 0] and [0, 0, 0, 2, 1]).

We first some up the columns for each class which gives the observed values:

 True: [1, 2, 0, 0, 1]
False: [0, 0, 1, 2, 1]

To compute the expected values, we compute the sum of all columns (i.e., the total count that the feature was observed over all classes) which gives [1, 2, 1, 2, 2]. If we assume there is no correlation between a feature and the class it was found in, the distribution must be according of these values must correspond to the number of samples we have. I.e., 1/3 of the values should be found in the True class and 2/3 in the False class, which gives the expected values:

 True: 1/3 * [1, 2, 1, 2, 2] = [1/3 2/3 1/3 2/3 2/3]
False: 2/3 * [1, 2, 1, 2, 2] = [2/3 4/3 2/3 4/3 4/3]

Now chi2 can be computed for each column, as an example for the most interesting last column:

(1-2/3)^2 / (2/3) + (1-4/3)^2 / (4/3) = 1/6 + 1/12 = 1/4 = 0.25

The error of 0.25 is relatively small, therefore, as one would expect, this feature is independent from the class.

Upvotes: 1

Related Questions