Reputation: 1973
I have a sample dataframe like this
m_list = ['male','male','female','female']
whiskey_list = ['alcohol','no_alcohol','alcohol','no_alcohol']
f1 = [273,62,60,7]
f2 = [276,61,57,8]
l = [m_list,whiskey_list,f1,f2]
test_df = pd.DataFrame(l).T
test_df.columns = ['gender','drink_category','f1','f2']
gender drink_category f1 f2
0 male alcohol 273 276
1 male no_alcohol 62 61
2 female alcohol 60 57
3 female no_alcohol 7 8
I want to see if there is any relationship between the 2 categories - gender
and drink_category
using a Chi-square test. For these purposes, I want to build a contingency table for each feature that ranges from f1,f2....fn
and then compute p-values
for each feature.
The example here has only 2 features f1
and f2
but in general I have many.
When I am processing f1
, then my contingency table would look like -
gender alcohol no_alcohol
male 273 62
female 60 7
Then I would compute p-value for f1
.
When I am processing f2
, then my contingency table would look like -
gender alcohol no_alcohol
male 276 61
female 57 8
How can I compute this using pandas
and scipy
libraries ?
At the end, I want a dataframe where I have p-values for each feature f1
to fn
.
Upvotes: 2
Views: 4894
Reputation: 30639
We can use scipy.stat's chi2_contingency
to get the p values for the contingency tables build with pandas' pivot
function.
import pandas as pd
from scipy.stats import chi2_contingency
test_df = pd.DataFrame({'gender': ['male','male','female','female'],
'drink_category': ['alcohol','no_alcohol','alcohol','no_alcohol'],
'f1': [273,62,60,7],
'f2': [276,61,57,8]})
p = pd.Series()
for feature in [c for c in test_df.columns if c.startswith('f')]:
_,p[feature],_,_ = chi2_contingency(test_df.pivot('gender','drink_category',feature))
print(p)
Output:
f1 0.155699
f2 0.339842
dtype: float64
Upvotes: 1