ggupta
ggupta

Reputation: 707

get the difference between max and min for a groupby in pandas and calculate the average

I have one dataframe like below:

ticker  fy  fp  value   f_date  rn
MSFT    2009    0   144 2010-01-01T12:12:34 0
AAPL    2010    0   144 2010-01-01T12:12:34 0
MSFT    2009    0   48  2014-05-01T12:12:34 1
AAPL    2011    0   80  2012-01-01T12:12:34 1
GOOG    2010    0   40  2010-01-01T12:12:34 0

I just want to groupby this data on the basis ticker, fy, fp just like below

df.groupby(by=['ticker', 'fy', 'fp'])

On the basis of this, i just want to calculate the difference of max and min of f_date and divide it by max of rn. For example, group MSFT, 2009, 0, max date is 2014-05-01T12:12:34 and min date is 2010-01-01T12:12:34, and the max rn is 1, so i want to calculate it as max(f_date) - min(f_date)/ max(rn+1). so i'll get the days inbetween of these two dates, hence i can map this data with other to do some analysis

i'm unable to move forward after groupby.

Upvotes: 1

Views: 1728

Answers (1)

jezrael
jezrael

Reputation: 862601

For pandas 0.25+ is possible use named aggregations, then subtract and divide columns:

df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg(min1=('f_date','min'),
                                               max1=('f_date','max'),
                                               rn=('rn', 'max'))

df['new'] = df['max1'].sub(df['min1']).div(df['rn'].add(1))
print (df)
                              min1                max1  rn               new
ticker fy   fp                                                              
AAPL   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34   0   0 days 00:00:00
       2011 0  2012-01-01 12:12:34 2012-01-01 12:12:34   1   0 days 00:00:00
GOOG   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34   0   0 days 00:00:00
MSFT   2009 0  2010-01-01 12:12:34 2014-05-01 12:12:34   1 790 days 12:00:00

Or if necessary convert difference of datetimes (timedeltas) to seconds by Series.dt.total_seconds:

df['new1'] = df['max1'].sub(df['min1']).dt.total_seconds().div(df['rn'].add(1))
print (df)
                              min1                max1  rn         new
ticker fy   fp                                                        
AAPL   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34   0         0.0
       2011 0  2012-01-01 12:12:34 2012-01-01 12:12:34   1         0.0
GOOG   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34   0         0.0
MSFT   2009 0  2010-01-01 12:12:34 2014-05-01 12:12:34   1  68299200.0

Solution for oldier pandas versions:

df['f_date'] = pd.to_datetime(df['f_date'])
df = df.groupby(by=['ticker', 'fy', 'fp']).agg({'f_date':['min','max'],
                                               'rn':'max'})
df.columns = df.columns.map('_'.join)
df['new'] = df['f_date_max'].sub(df['f_date_min']).div(df['rn_max'].add(1))
print (df)
                        f_date_min          f_date_max  rn_max  \
ticker fy   fp                                                   
AAPL   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34       0   
       2011 0  2012-01-01 12:12:34 2012-01-01 12:12:34       1   
GOOG   2010 0  2010-01-01 12:12:34 2010-01-01 12:12:34       0   
MSFT   2009 0  2010-01-01 12:12:34 2014-05-01 12:12:34       1   

                             new  
ticker fy   fp                    
AAPL   2010 0    0 days 00:00:00  
       2011 0    0 days 00:00:00  
GOOG   2010 0    0 days 00:00:00  
MSFT   2009 0  790 days 12:00:00  

Last if necessary convert MultiIndex to columns:

df = df.reset_index()
print (df)
  ticker    fy  fp          f_date_min          f_date_max  rn_max  \
0   AAPL  2010   0 2010-01-01 12:12:34 2010-01-01 12:12:34       0   
1   AAPL  2011   0 2012-01-01 12:12:34 2012-01-01 12:12:34       1   
2   GOOG  2010   0 2010-01-01 12:12:34 2010-01-01 12:12:34       0   
3   MSFT  2009   0 2010-01-01 12:12:34 2014-05-01 12:12:34       1   

                new  
0   0 days 00:00:00  
1   0 days 00:00:00  
2   0 days 00:00:00  
3 790 days 12:00:00  

Upvotes: 3

Related Questions