dustin
dustin

Reputation: 4406

Python: doing multiple column aggregation in pandas

I have dataframe where I went to do multiple column aggregations in pandas.

import pandas as pd
import numpy as np
df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]})

df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean})

With this code, I get the mean for lat. I would also like to find the mean for long.

I tried df2 = df.groupby(['ser_no', 'CTRY_NM']).lat.agg({'avg_lat': np.mean}).long.agg({'avg_long': np.mean}) but this produces

AttributeError: 'DataFrame' object has no attribute 'long'

If I just do avg_long, the code works as well.

df2 = df.groupby(['ser_no', 'CTRY_NM']).long.agg({'avg_long': np.mean})

In[2]: df2
Out[42]: 
                avg_long
ser_no CTRY_NM          
1      a            21.5
       b            23.0
2      a            26.0
       b            27.0
       e            24.5
3      b            28.5
       d            30.0

Is there a way to do this in one step or is this something I have to do separately and join back later?

Upvotes: 1

Views: 1292

Answers (2)

jezrael
jezrael

Reputation: 862711

I think more simplier is use GroupBy.mean:

print df.groupby(['ser_no', 'CTRY_NM']).mean()
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

Ir you need define columns for aggregating:

print df.groupby(['ser_no', 'CTRY_NM']).agg({'lat' : 'mean', 'long' : 'mean'})
                 lat  long
ser_no CTRY_NM            
1      a         1.5  21.5
       b         3.0  23.0
2      a         6.0  26.0
       b         7.0  27.0
       e         4.5  24.5
3      b         8.5  28.5
       d        10.0  30.0

More info in docs.

EDIT:

If you need rename column names - remove multiindex in columns, you can use list comprehension:

import pandas as pd

df = pd.DataFrame({'ser_no': [1, 1, 1, 2, 2, 2, 2, 3, 3, 3],
                'CTRY_NM': ['a', 'a', 'b', 'e', 'e', 'a', 'b', 'b', 'b', 'd'],
                'lat': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                'long': [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
                'date':pd.date_range(pd.to_datetime('2016-02-24'),
                                     pd.to_datetime('2016-02-28'), freq='10H')})

print df               
  CTRY_NM                date  lat  long  ser_no
0       a 2016-02-24 00:00:00    1    21       1
1       a 2016-02-24 10:00:00    2    22       1
2       b 2016-02-24 20:00:00    3    23       1
3       e 2016-02-25 06:00:00    4    24       2
4       e 2016-02-25 16:00:00    5    25       2
5       a 2016-02-26 02:00:00    6    26       2
6       b 2016-02-26 12:00:00    7    27       2
7       b 2016-02-26 22:00:00    8    28       3
8       b 2016-02-27 08:00:00    9    29       3
9       d 2016-02-27 18:00:00   10    30       3              

df2=df.groupby(['ser_no','CTRY_NM']).agg({'lat':'mean','long':'mean','date':[min,max,'count']})
df2.columns = ['_'.join(col) for col in df2.columns]
print df2
                lat_mean            date_min            date_max  date_count  \
ser_no CTRY_NM                                                                 
1      a             1.5 2016-02-24 00:00:00 2016-02-24 10:00:00           2   
       b             3.0 2016-02-24 20:00:00 2016-02-24 20:00:00           1   
2      a             6.0 2016-02-26 02:00:00 2016-02-26 02:00:00           1   
       b             7.0 2016-02-26 12:00:00 2016-02-26 12:00:00           1   
       e             4.5 2016-02-25 06:00:00 2016-02-25 16:00:00           2   
3      b             8.5 2016-02-26 22:00:00 2016-02-27 08:00:00           2   
       d            10.0 2016-02-27 18:00:00 2016-02-27 18:00:00           1   

                long_mean  
ser_no CTRY_NM             
1      a             21.5  
       b             23.0  
2      a             26.0  
       b             27.0  
       e             24.5  
3      b             28.5  
       d             30.0  

Upvotes: 2

user2285236
user2285236

Reputation:

You are getting the error because you are first selecting the lat column of the dataframe and doing operations on that column. Getting the long column through that series is not possible, you need the dataframe.

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean)

would do the same operation for both columns. If you want the column names changed, you can rename the columns afterwards:

df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})

In [22]:
df2 = df.groupby(['ser_no', 'CTRY_NM'])["lat", "long"].agg(np.mean).rename(columns = {"lat": "avg_lat", "long": "avg_long"})
df2
Out[22]:
                    avg_lat avg_long
ser_no  CTRY_NM     
1       a           1.5     21.5
        b           3.0     23.0
2       a           6.0     26.0
        b           7.0     27.0
        e           4.5     24.5
3       b           8.5     28.5
        d           10.0    30.0

Upvotes: 1

Related Questions