Reputation: 904
Using the following test data frame containing binary 0/1 variables:
test_df = pd.DataFrame([
[0, 0, 0, 1],
[1, 0, 1, 1],
[0, 0, 0, 1],
[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 1, 0]], columns=["y", "age_catg", "race_catg", "sex_catg"])
I'd like to use the pd.crosstab()
function to create two-way tables of y vs. age_catg, race_catg, sex_catg in order to check for complete separation of y values among the predictor categories.
My actual data frame contains several thousand predictors, so rather than explicitly naming the age, race, and sex predictors I'd prefer to use columns #'s. However, I'm still confused with row & column references in Python - for example the following code doesn't work:
desc_tab = pd.crosstab(test_df[:,1], test_df[:,2:4])
desc_tab
Upvotes: 1
Views: 389
Reputation: 3001
To use integer indexes you need the iloc
method:
pd.crosstab(test_df.iloc[:, 1], test_df.iloc[:, 2])
Output:
race_catg 0 1
age_catg
0 3 3
You can pass several arrays/series to either columns or rows if you put them in a list:
pd.crosstab(test_df.iloc[:, 1], [test_df.iloc[:, 2], test_df.iloc[:, 3]])
race_catg 0 1
sex_catg 0 1 0 1
age_catg
0 1 2 2 1
If you want to batch define the columns by their indices (list is a reserved word in python, please don't use it):
cols = [test_df.iloc[:, i] for i in [2, 3]]
pd.crosstab(test_df.iloc[:, 1], cols)
Output:
race_catg 0 1
sex_catg 0 1 0 1
age_catg
0 1 2 2 1
Upvotes: 2