RobertF
RobertF

Reputation: 904

Novice Python question: How to create crosstabs across multiple predictor variables and outcome variable

Using the following test data frame containing binary 0/1 variables:

test_df = pd.DataFrame([
    [0, 0, 0, 1],
    [1, 0, 1, 1],
    [0, 0, 0, 1],
    [1, 0, 1, 0],
    [0, 0, 0, 0],
    [1, 0, 1, 0]], columns=["y", "age_catg", "race_catg", "sex_catg"])

I'd like to use the pd.crosstab() function to create two-way tables of y vs. age_catg, race_catg, sex_catg in order to check for complete separation of y values among the predictor categories.

My actual data frame contains several thousand predictors, so rather than explicitly naming the age, race, and sex predictors I'd prefer to use columns #'s. However, I'm still confused with row & column references in Python - for example the following code doesn't work:

desc_tab = pd.crosstab(test_df[:,1],  test_df[:,2:4])     
desc_tab

Upvotes: 1

Views: 389

Answers (1)

To use integer indexes you need the iloc method:

pd.crosstab(test_df.iloc[:, 1], test_df.iloc[:, 2])

Output:

race_catg  0  1
age_catg       
0          3  3

You can pass several arrays/series to either columns or rows if you put them in a list:

pd.crosstab(test_df.iloc[:, 1], [test_df.iloc[:, 2], test_df.iloc[:, 3]])

race_catg  0     1
sex_catg   0  1  0  1
age_catg             
0          1  2  2  1

EDIT

If you want to batch define the columns by their indices (list is a reserved word in python, please don't use it):

cols = [test_df.iloc[:, i] for i in [2, 3]]
pd.crosstab(test_df.iloc[:, 1], cols)

Output:

race_catg  0     1   
sex_catg   0  1  0  1
age_catg             
0          1  2  2  1

Upvotes: 2

Related Questions