Python: doing multiple column aggregation in pandas

Question

I have dataframe where I went to do multiple column aggregations in pandas.

import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})

df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})

With this code, I get the mean for lat. I would also like to find the mean for long.

I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces

AttributeError: 'DataFrame' object has no attribute 'long'

If I just do avg_long, the code works as well.

df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})

In[2]: df2
Out[42]: 
                avg_long
ser_no CTRY_NM          
1      a            21.5
       b            23.0
2      a            26.0
       b            27.0
       e            24.5
3      b            28.5
       d            30.0

Is there a way to do this in one step or is this something I have to do separately and join back later?

jezrael · Accepted Answer

I think more simplier is use GroupBy.mean:

print df.groupby(['ser_no', 'CTRY_NM']).mean()
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

Ir you need define columns for aggregating:

print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

More info in docs.

EDIT:

If you need rename column names - remove multiindex in columns, you can use list comprehension:

import pandas as pd

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
                'date':pd.date_range(pd.to_datetime('2016-02-24'),
                                     pd.to_datetime('2016-02-28'), freq='10H')})

print df               
  CTRY_NM                date  lat  long  ser_no
0       a 2016-02-24 00:00:00    1    21       1
1       a 2016-02-24 10:00:00    2    22       1
2       b 2016-02-24 20:00:00    3    23       1
3       e 2016-02-25 06:00:00    4    24       2
4       e 2016-02-25 16:00:00    5    25       2
5       a 2016-02-26 02:00:00    6    26       2
6       b 2016-02-26 12:00:00    7    27       2
7       b 2016-02-26 22:00:00    8    28       3
8       b 2016-02-27 08:00:00    9    29       3
9       d 2016-02-27 18:00:00   10    30       3              

df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]

print df2
                lat_mean            date_min            date_max  date_count  \
ser_no CTRY_NM                                                                 
1      a             1.5 2016-02-24 00:00:00 2016-02-24 10:00:00           2   
       b             3.0 2016-02-24 20:00:00 2016-02-24 20:00:00           1   
2      a             6.0 2016-02-26 02:00:00 2016-02-26 02:00:00           1   
       b             7.0 2016-02-26 12:00:00 2016-02-26 12:00:00           1   
       e             4.5 2016-02-25 06:00:00 2016-02-25 16:00:00           2   
3      b             8.5 2016-02-26 22:00:00 2016-02-27 08:00:00           2   
       d            10.0 2016-02-27 18:00:00 2016-02-27 18:00:00           1   

                long_mean  
ser_no CTRY_NM             
1      a             21.5  
       b             23.0  
2      a             26.0  
       b             27.0  
       e             24.5  
3      b             28.5  
       d             30.0

Python: doing multiple column aggregation in pandas

Answers (2)

Related Questions