Computing a chi square statistic from scratch using numpy/pandas, matrix computations

Question

I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".

I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.

Here's the code:

csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)

col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)

l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))

expected_workers = pd.DataFrame(
    np.array(l).reshape((3, 4)),
    columns=observed_workers.columns,
    index=observed_workers.index,
)

chi_squared_stat = (
    ((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)

This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.

Computing a chi square statistic from scratch using numpy/pandas, matrix computations

Answers (1)

Related Questions