Reputation: 763

How to create multiple summary statistics for each column in a grouping?

Using groupby().agg() allows to calculate summary statistics for specifically named columns. However, what if I want to calculate „min“, „max“ and „mean“ for every column of the data frame per group. Is there a way such that pandas will append a prefix to each column name automatically? I do not want to enumerate each basic column name within the agg() function.

Upvotes: 0

Answers (2)

sitting_duck

Reputation: 3720

You could get there using describe():

df1 = pd.DataFrame(df.describe().unstack())
n_label = pd.Series(['_'.join(map(str,i)) for i in df1.index.tolist()])
df1 = df1.reset_index(drop=True)
df1['label'] = n_label
print(df1[df1['label'].str.contains('_m')].reset_index(drop=True))

         0      label
0   4.0105  col1_mean
1   0.0000   col1_min
2  12.0000   col1_max
3   3.9639  col2_mean
4   0.0000   col2_min
5  12.0000   col2_max
6   4.0256  col3_mean
7   0.0000   col3_min
8  12.0000   col3_max

Upvotes: 0

Hammurabi

Reputation: 1179

You can iterate through every column, then create the prefixes etc. using the original column name as a starting point. If you use .agg and do min and max on the same column, you only get the last operation as far as I can tell, though maybe there is a way to do that. So in this example, I do one operation at a time. Here's one way to do what you want, assuming there is a certain column 'col1' that you will use to use to line up all the groupby data.:

df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2': [1, 2, 3, 4], 'col3': [5, 6, 7, 8]})

col_list = df.columns.tolist()
col_list.remove('col1')  # the column you will use for the groupby output
dfg_all = df[['col1']].drop_duplicates()

for col in col_list:
    for op in ['min', 'max', 'mean']:
        if op == 'min':
            dfg = df.groupby('col1', as_index=False)[col].min()
        elif op == 'max':
            dfg = df.groupby('col1', as_index=False)[col].max()
        else:
            dfg = df.groupby('col1', as_index=False)[col].mean()
        dfg = dfg.rename(columns={col:col+'_'+ op})
        dfg_all = dfg_all.merge(dfg, on='col1', how='left')

to get

  col1  col2_min  col2_max  col2_mean  col3_min  col3_max  col3_mean
0    A         1         2        1.5         5         6        5.5
1    B         3         4        3.5         7         8        7.5

Upvotes: 0

How to create multiple summary statistics for each column in a grouping?

Answers (2)

Related Questions