cph_sto
cph_sto

Reputation: 7607

Aggregation in pandas dataframe with columns names in one row

I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.

df = pd.DataFrame({'ID':[1,1,2,2,2],
                   'revenue':[1,3,5,1,5],
                   'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
   ID       month  revenue
0   1  2012-01-01        1
1   1  2012-01-01        3
2   2  2012-03-01        5
3   2  2014-01-01        1
4   2  2012-01-01        5

Doing the aggregation below.

df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
  ID revenue      month            
         sum distinct_m     month_m
0  1       4          1  2012-01-01
1  2      11          3  2012-03-01

Desired output is:

  ID revenue   distinct_m       month
0  1       4            1  2012-01-01
1  2      11            3  2012-03-01

The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:

  ID revenue 
0  1       4 
1  2      11 

But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.

Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.

Upvotes: 1

Views: 1830

Answers (1)

jezrael
jezrael

Reputation: 863291

For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:

df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
                             distinct_m=('month','nunique'),
                             month_m = ('month','first')
                            ).reset_index())
print(df)
   ID  revenue  distinct_m     month_m
0   1        4           1  2012-01-01
1   2       11           3  2012-03-01

For lower pandas versions is possible flatten columns in MultiIndex and then rename:

df = df.groupby(['ID']).agg({'revenue':'sum',
                             'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
                        'month_distinct_m':'distinct_m',
                        'month_month_m':'month_m'})
df = df.reset_index()
print(df)
   ID  revenue  distinct_m     month_m
0   1        4           1  2012-01-01
1   2       11           3  2012-03-01

Upvotes: 3

Related Questions