RROBINSON
RROBINSON

Reputation: 191

Python: Chi Squared for categorical values in large dataset

I no experience of note with Python, and am trying to use it for a statistical analysis of a very large dataset (10 million cases) because the other options (SPSS and R) are unable to handle the dataset on the authorized hardware.

In this dataset, there are many categorical variables (Diagnosis1, Diagnosis2...Diagnosis30) and an Event variable (the dependent variable).
Cases are listed as rows.

Something like this

Diagnosis1       Diagnosis2         Diagnosis3   Event
1                0                  0            1
0                1                  0            0 
0                1                  0            0 

....and so on

I can load the data and review it with this -

    import pandas as pd
    import numpy as np
    NRD_Data = pd.read_csv('NRD_DL.csv')
    NRD_Data.head()

but I am stuck on how to build 2x2 tables and perform a Chi Square test on the tables.

            Diagnosis1=1   Diagnosis1=0
Event=1     100            12
Event=0     80             45

Something akin to running cross-tabs on SPSS to compare categorial values is the desired result.

Upvotes: 5

Views: 2015

Answers (1)

BENY
BENY

Reputation: 323356

Using pd.crosstab to get the matrix you need , then you can do your Chi Square test

l=['Diagnosis1',  'Diagnosis2',  'Diagnosis3']
d=[]
for i in l:
    d.append(pd.crosstab(df['Event'],df[i]))
d[0]
Out[569]: 
Diagnosis1  0  1
Event           
0           2  0
1           0  1

Upvotes: 3

Related Questions