Reputation: 7733
I have a dataframe like as shown below
df = pd.DataFrame(
{'sub_code' : ['CSE01', 'CSE01', 'CSE01',
'CSE02', 'CSE03', 'CSE04',
'CSE05', 'CSE06'],
'stud_level' : [101, 101, 101, 101,
101, 101, 101, 101],
'grade' : ['STA','STA','PSA','STA','STA','SSA','PSA','QSA']})
I would like to do the below
a) get the frequency of each unique value in all categorical columns of a dataframe
I tried the below but it is neither efficient nor elegant
df['sub_code'].value_counts() # need to key in column name manually
df['grade'].value_counts() # need to key in column name manually
df.select_dtypes(include='object').value_counts() #produces incorrect output
As my real data has more than 200 columns and 100k rows, is there any efficient approach to do this?
I expect my output to be like as shown below. If there is any other better way to show the below output, I welcome that as well. I don't know how else to capture this information in a neat manner. So, please do share your suggestions
Upvotes: 1
Views: 1242
Reputation: 26676
Another way would be to use, get_dummies
s=pd.get_dummies(df.drop(columns=['stud_level'])).sum(0).to_frame('Freq').reset_index()
s=s['index'].str.split('\_(?=[A-Z0-9]+$)', expand=True).join(s.iloc[:,-1]).rename(columns={0:'Column',1:'Value'})
outcome
Column Value Freq
0 sub_code CSE01 3
1 sub_code CSE02 1
2 sub_code CSE03 1
3 sub_code CSE04 1
4 sub_code CSE05 1
5 sub_code CSE06 1
6 grade PSA 2
7 grade QSA 1
8 grade SSA 1
9 grade STA 4
Upvotes: 1
Reputation:
Use melt
and value_counts
:
out = (df.select_dtypes(object)
.melt(var_name="Column", value_name="Value")
.value_counts(dropna=False)
.reset_index(name="Frequency")
.sort_values(by=['Column','Frequency','Variable'], ascending=[True,False,True])
.reset_index(drop=True))
Output:
Column Variable Frequency
0 grade STA 4
1 grade PSA 2
2 grade QSA 1
3 grade SSA 1
4 sub_code CSE01 3
5 sub_code CSE02 1
6 sub_code CSE03 1
7 sub_code CSE04 1
8 sub_code CSE05 1
9 sub_code CSE06 1
Upvotes: 2