baxx
baxx

Reputation: 4725

Computing a chi square statistic from scratch using numpy/pandas, matrix computations

I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".

I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.

Here's the code:

csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)

col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)

l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))

expected_workers = pd.DataFrame(
    np.array(l).reshape((3, 4)),
    columns=observed_workers.columns,
    index=observed_workers.index,
)

chi_squared_stat = (
    ((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)

This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.

Upvotes: 0

Views: 526

Answers (1)

David M.
David M.

Reputation: 4598

With numpy/scipy:

csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""

import io
from numpy import genfromtxt, outer
from scipy.stats.contingency import margins

observed = genfromtxt(io.StringIO(csv), delimiter=',', skip_header=True, usecols=range(1, 5))
row_sums, col_sums = margins(observed)
expected = outer(row_sums, col_sums) / observed.sum()
chi_squared_stat = ((observed - expected)**2 / expected).sum()

print(chi_squared_stat)

With pandas:

import io
import pandas as pd

csv = """\
work_group,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
df = pd.read_csv(io.StringIO(csv))

df_melt = df.melt(id_vars ='work_group', var_name='group', value_name='observed')
df_melt['col_sum'] = df_melt.groupby('group')['observed'].transform(np.sum)
df_melt['row_sum'] = df_melt.groupby('work_group')['observed'].transform(np.sum)
total = df_melt['observed'].sum()

df_melt['expected'] = df_melt.apply(lambda row: row['col_sum']*row['row_sum']/total, axis=1)
chi_squared_stat = df_melt.apply(lambda row: ((row['observed'] - row['expected'])**2) / row['expected'], axis=1).sum()

print(chi_squared_stat)

Upvotes: 1

Related Questions