Reputation: 43
I have a df with raw survey data similar to the following with 12000 rows and forty questions. All responses are categorical
import pandas as pd
df = pd.DataFrame({'Age' : ['20-30','20-30','30-45', '20-30','30-45','20-30'],
'Gender' : ['M', 'F', 'F','F','M','F'],
'Income' : ['20-30k', '30-40k', '40k+', '40k+', '40k+', '20-30k'],
'Question1' : ['Good','Bad','OK','OK','Bad','Bad'],
'Question2' : ['Happy','Unhappy','Very_Unhappy','Very_Unhappy','Very_Unhappy','Happy']})
I want to categorize the responses to each question according to Age, Gender and Income, to produce a frequency (by %) table for each question that looks like this screenshot showing questions.
Crosstab produces too many categories ie it breaks down by income and within income, by age etc. So I'm not sure how best to go about this. I'm sure this is an easy problem but I'm new to python to any help would be appreciated
Upvotes: 4
Views: 278
Reputation: 323226
You can do melt
before crosstab
s=df.drop('Question2',1).\
melt(['Age','Gender','Income']).drop('variable',1).\
rename(columns={'value':'v1'}).melt('v1')
pd.crosstab(s.v1,[s.variable,s.value])
Out[235]:
variable Age Gender Income
value 20-30 30-45 F M 20-30k 30-40k 40k+
v1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
Upvotes: 2
Reputation: 38415
As you said, using cross tab for all the columns breaks down the result by each column. You can use individual cross tabs and then concat
pd.concat([pd.crosstab(df.Question1, df.Gender), pd.crosstab(df.Question1, df.Income), pd.crosstab(df.Question1, df.Age)], axis = 1)
F M 20-30k 30-40k 40k+ 20-30 30-45
Question1
Bad 2 1 1 1 1 2 1
Good 0 1 1 0 0 1 0
OK 2 0 0 0 2 1 1
Edit: To get additional level at columns
age = pd.crosstab(df.Question1, df.Age)
age.columns = pd.MultiIndex.from_product([['Age'], age.columns])
gender = pd.crosstab(df.Question1, df.Gender)
gender.columns = pd.MultiIndex.from_product([['Gender'], gender.columns])
income = pd.crosstab(df.Question1, df.Income)
income.columns = pd.MultiIndex.from_product([['Income'], income.columns])
pd.concat([age, gender, income], axis = 1)
Age Gender Income
20-30 30-45 F M 20-30k 30-40k 40k+
Question1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
Upvotes: 2