zepaft
zepaft

Reputation: 43

How to generate a frequency table from a dataframe?

I have a df with raw survey data similar to the following with 12000 rows and forty questions. All responses are categorical

import pandas as pd

df = pd.DataFrame({'Age' : ['20-30','20-30','30-45', '20-30','30-45','20-30'],
                   'Gender' : ['M', 'F', 'F','F','M','F'],
                   'Income' : ['20-30k', '30-40k', '40k+', '40k+', '40k+', '20-30k'],
                   'Question1' : ['Good','Bad','OK','OK','Bad','Bad'],
                   'Question2' : ['Happy','Unhappy','Very_Unhappy','Very_Unhappy','Very_Unhappy','Happy']})

I want to categorize the responses to each question according to Age, Gender and Income, to produce a frequency (by %) table for each question that looks like this screenshot showing questions.

enter image description here

Crosstab produces too many categories ie it breaks down by income and within income, by age etc. So I'm not sure how best to go about this. I'm sure this is an easy problem but I'm new to python to any help would be appreciated

Upvotes: 4

Views: 278

Answers (2)

BENY
BENY

Reputation: 323226

You can do melt before crosstab

s=df.drop('Question2',1).\
     melt(['Age','Gender','Income']).drop('variable',1).\
        rename(columns={'value':'v1'}).melt('v1')

pd.crosstab(s.v1,[s.variable,s.value])
Out[235]: 
variable   Age       Gender    Income            
value    20-30 30-45      F  M 20-30k 30-40k 40k+
v1                                               
Bad          2     1      2  1      1      1    1
Good         1     0      0  1      1      0    0
OK           1     1      2  0      0      0    2

Upvotes: 2

Vaishali
Vaishali

Reputation: 38415

As you said, using cross tab for all the columns breaks down the result by each column. You can use individual cross tabs and then concat

pd.concat([pd.crosstab(df.Question1, df.Gender), pd.crosstab(df.Question1, df.Income), pd.crosstab(df.Question1, df.Age)], axis = 1)

            F   M   20-30k  30-40k  40k+    20-30   30-45
Question1                           
Bad         2   1   1       1       1       2       1
Good        0   1   1       0       0       1       0
OK          2   0   0       0       2       1       1

Edit: To get additional level at columns

age = pd.crosstab(df.Question1, df.Age)
age.columns = pd.MultiIndex.from_product([['Age'], age.columns])
gender = pd.crosstab(df.Question1, df.Gender)
gender.columns = pd.MultiIndex.from_product([['Gender'], gender.columns])
income = pd.crosstab(df.Question1, df.Income)
income.columns = pd.MultiIndex.from_product([['Income'], income.columns])
pd.concat([age, gender, income], axis = 1)


        Age             Gender  Income
        20-30   30-45   F   M   20-30k  30-40k  40k+
Question1                           
Bad     2       1       2   1   1       1       1
Good    1       0       0   1   1       0       0
OK      1       1       2   0   0       0       2

Upvotes: 2

Related Questions