Reputation: 191
I no experience of note with Python, and am trying to use it for a statistical analysis of a very large dataset (10 million cases) because the other options (SPSS and R) are unable to handle the dataset on the authorized hardware.
In this dataset, there are many categorical variables (Diagnosis1, Diagnosis2...Diagnosis30) and an Event variable (the dependent variable).
Cases are listed as rows.
Something like this
Diagnosis1 Diagnosis2 Diagnosis3 Event
1 0 0 1
0 1 0 0
0 1 0 0
....and so on
I can load the data and review it with this -
import pandas as pd
import numpy as np
NRD_Data = pd.read_csv('NRD_DL.csv')
NRD_Data.head()
but I am stuck on how to build 2x2 tables and perform a Chi Square test on the tables.
Diagnosis1=1 Diagnosis1=0
Event=1 100 12
Event=0 80 45
Something akin to running cross-tabs on SPSS to compare categorial values is the desired result.
Upvotes: 5
Views: 2015
Reputation: 323356
Using pd.crosstab
to get the matrix you need , then you can do your Chi Square test
l=['Diagnosis1', 'Diagnosis2', 'Diagnosis3']
d=[]
for i in l:
d.append(pd.crosstab(df['Event'],df[i]))
d[0]
Out[569]:
Diagnosis1 0 1
Event
0 2 0
1 0 1
Upvotes: 3