Reputation: 4725
I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".
I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.
Here's the code:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)
col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)
l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))
expected_workers = pd.DataFrame(
np.array(l).reshape((3, 4)),
columns=observed_workers.columns,
index=observed_workers.index,
)
chi_squared_stat = (
((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)
This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.
Upvotes: 0
Views: 526
Reputation: 4598
With numpy/scipy:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
import io
from numpy import genfromtxt, outer
from scipy.stats.contingency import margins
observed = genfromtxt(io.StringIO(csv), delimiter=',', skip_header=True, usecols=range(1, 5))
row_sums, col_sums = margins(observed)
expected = outer(row_sums, col_sums) / observed.sum()
chi_squared_stat = ((observed - expected)**2 / expected).sum()
print(chi_squared_stat)
With pandas:
import io
import pandas as pd
csv = """\
work_group,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
df = pd.read_csv(io.StringIO(csv))
df_melt = df.melt(id_vars ='work_group', var_name='group', value_name='observed')
df_melt['col_sum'] = df_melt.groupby('group')['observed'].transform(np.sum)
df_melt['row_sum'] = df_melt.groupby('work_group')['observed'].transform(np.sum)
total = df_melt['observed'].sum()
df_melt['expected'] = df_melt.apply(lambda row: row['col_sum']*row['row_sum']/total, axis=1)
chi_squared_stat = df_melt.apply(lambda row: ((row['observed'] - row['expected'])**2) / row['expected'], axis=1).sum()
print(chi_squared_stat)
Upvotes: 1