chi-square test for multiple features in Pandas

Question

I have a sample dataframe like this

m_list = ['male','male','female','female']
whiskey_list = ['alcohol','no_alcohol','alcohol','no_alcohol']
f1 = [273,62,60,7]
f2 = [276,61,57,8]
l = [m_list,whiskey_list,f1,f2]
test_df = pd.DataFrame(l).T
test_df.columns = ['gender','drink_category','f1','f2']


    gender  drink_category  f1  f2
0   male    alcohol         273 276
1   male    no_alcohol      62  61
2   female  alcohol         60  57
3   female  no_alcohol      7   8

I want to see if there is any relationship between the 2 categories - gender and drink_category using a Chi-square test. For these purposes, I want to build a contingency table for each feature that ranges from f1,f2....fn and then compute p-values for each feature.

The example here has only 2 features f1 and f2 but in general I have many.

When I am processing f1, then my contingency table would look like -

gender   alcohol   no_alcohol
male      273        62
female    60         7

Then I would compute p-value for f1.

When I am processing f2, then my contingency table would look like -

gender   alcohol   no_alcohol
male      276        61
female    57         8

How can I compute this using pandas and scipy libraries ?

At the end, I want a dataframe where I have p-values for each feature f1 to fn.

Stef · Accepted Answer

We can use scipy.stat's chi2_contingency to get the p values for the contingency tables build with pandas' pivot function.

import pandas as pd
from scipy.stats import chi2_contingency

test_df = pd.DataFrame({'gender': ['male','male','female','female'],
                        'drink_category': ['alcohol','no_alcohol','alcohol','no_alcohol'],
                        'f1': [273,62,60,7],
                        'f2': [276,61,57,8]})

p = pd.Series()
for feature in [c for c in test_df.columns if c.startswith('f')]:
   _,p[feature],_,_ = chi2_contingency(test_df.pivot('gender','drink_category',feature))

print(p)

Output:

f1    0.155699
f2    0.339842
dtype: float64

chi-square test for multiple features in Pandas

Answers (1)

Related Questions