johnbaltis
johnbaltis

Reputation: 1578

pandas, apply multiple functions of multiple columns to groupby object

I want to apply multiple functions of multiple columns to a groupby object which results in a new pandas.DataFrame.

I know how to do it in seperate steps:

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))

Which results in user_df being: user_df

However I suspect that there is a better way, like:

by_user.agg({'elapsed_days': lambda x: (x.elapsed_time * x.num_cores).sum() / 86400, 
             'running_days': lambda x: (x.running_time * x.num_cores).sum() / 86400})

However, this doesn't work, because AFAIK agg() works on pandas.Series.

I did find this question and answer, but the solutions look rather ugly to me, and considering that the answer is nearly four years old, there might be a better way by now.

Upvotes: 14

Views: 10966

Answers (6)

Mysterious Otter
Mysterious Otter

Reputation: 4029

Another solid variation of the solution is to do what @MaxU did with this solution to a similar question and wrap the individual functions in a Pandas series, thus only requiring a reset_index() to return a dataframe.

First, define the functions for transformations:

def ed(group):
    return group.elapsed_time * group.num_cores).sum() / 86400

def rd(group):
    return group.running_time * group.num_cores).sum() / 86400

Wrap them up in a Series using get_stats:

def get_stats(group):
    return pd.Series({'elapsed_days': ed(group),
                      'running_days':rd(group)})

Finally:

lasts.groupby('user').apply(get_stats).reset_index()

Upvotes: 7

jrjc
jrjc

Reputation: 21873

To use the agg method on a groupby object by using data from other columns of the same dataframe you could do the following:

  1. Define your functions (lambda functions or not) that take as an input a Series, and get the data from other column(s) using the df.loc[series.index, col] syntax. With this example:

    ed = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400. 
    rd = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
    

    where lasts is the main DataFrame, and we access the data in the column num_cores thanks to the .loc method.

  2. Create a dictionary with these functions and the name for the newly created columns. The keys are the name of the columns on which to apply each function, and the value is another dictionary where the key is the name of the function and the value is the function.

    my_func = {"elapsed_time" : {"elapsed_day" : ed},
               "running_time" : {"running_days" : rd}}
    
  3. Groupby and aggregate:

    user_df = lasts.groupby("user").agg(my_func)
    user_df
         elapsed_time running_time
          elapsed_day running_days
    user                          
    a        3.240741     2.430556
    d       10.416667     3.819444
    s        4.629630     1.851852
    
  4. If you want to remove the old column names:

     user_df.columns = user_df.columns.droplevel(0)
     user_df
          elapsed_day  running_days
    user                           
    a        3.240741      2.430556
    d       10.416667      3.819444
    s        4.629630      1.851852
    

HTH

Upvotes: 2

jberrio
jberrio

Reputation: 1124

This agg function might be what you're looking for.

I added an example dataset and applied the operation to a copy of lasts which I named lasts_.

import pandas as pd

lasts = pd.DataFrame({'user'        :['james','james','james','john','john'],
                      'elapsed_time':[ 200000, 400000, 300000,800000,900000],
                      'running_time':[ 100000, 100000, 200000,600000,700000],
                      'num_cores'   :[      4,      4,      4,     8,     8] })

# create temporary df to add columns to, without modifying original dataframe
lasts_ = pd.Series.to_frame(lasts.loc[:,'user'])  # using 'user' column to initialize copy of new dataframe.  to_frame gives dataframe instead of series so more columns can be added below
lasts_['elapsed_days'] = lasts.loc[:,'elapsed_time'] * lasts.loc[:,'num_cores'] / 86400
lasts_['running_days'] = lasts.loc[:,'running_time'] * lasts.loc[:,'num_cores'] / 86400

# aggregate
by_user = lasts_.groupby('user').agg({'elapsed_days': 'sum', 
                                      'running_days': 'sum' })

# by_user:
# user  elapsed_days        running_days
# james 41.66666666666667   18.51851851851852
# john  157.4074074074074   120.37037037037037

If you want to keep 'user' as normal column instead of index column, use:

by_user = lasts_.groupby('user', as_index=False).agg({'elapsed_days': 'sum', 
                                                      'running_days': 'sum'})

Upvotes: 0

chthonicdaemon
chthonicdaemon

Reputation: 19760

Here is a solution which closely resembles the original idea expressed under "I suspect there is a better way".

I'll use the same testing data as the other answers:

lasts = pd.DataFrame({'user':['a','s','d','d'],
                      'elapsed_time':[40000,50000,60000,90000],
                      'running_time':[30000,20000,30000,15000],
                      'num_cores':[7,8,9,4]})

groupby.apply can accept a function which returns a dataframe and will then automatically stitch the returned dataframes together. There are two small catches in the wording below. The first is noticing that the values passed to DataFrame are in fact single-element lists instead of just numbers.

def aggfunc(group):
    """ This function mirrors the OP's idea. Note the values below are lists """
    return pd.DataFrame({'elapsed_days': [(group.elapsed_time * group.num_cores).sum() / 86400], 
                         'running_days': [(group.running_time * group.num_cores).sum() / 86400]})

user_df = lasts.groupby('user').apply(aggfunc)

Result:

        elapsed_days  running_days
user                              
a    0      3.240741      2.430556
d    0     10.416667      3.819444
s    0      4.629630      1.851852

The second is that the returned dataframe has a hierarchical index (that column of zeros), which can be flattened as shown below:

user_df.index = user_df.index.levels[0]

Result:

      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

Upvotes: 0

jeremycg
jeremycg

Reputation: 24945

In response to the bounty, we can make it more general, by using partial application, from the standard libraries functools.partial function.

import functools
import pandas as pd

#same data as other answer:
lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

#define the desired lambda as a function:
def myfunc(column, df, cores):
    return (column * df.ix[column.index][cores]).sum()/86400

#use the partial to define the function with a given column and df:
mynewfunc = functools.partial(myfunc, df = lasts, cores = 'num_cores')

#agg by the partial function
lasts.groupby('user').agg({'elapsed_time':mynewfunc, 'running_time':mynewfunc})

Which gives us:

    running_time    elapsed_time
user        
a   2.430556    3.240741
d   3.819444    10.416667
s   1.851852    4.629630

This is not super useful for the example given, but may be more useful as a general example.

Upvotes: 1

jezrael
jezrael

Reputation: 862521

I think you can avoid agg or apply and rather first multiple by mul, then div and last use groupby by index with aggregating sum:

lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

print (lasts)
   elapsed_time  num_cores  running_time user
0         40000          7         30000    a
1         50000          8         20000    s
2         60000          9         30000    d
3         90000          4         15000    d
by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852
lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
                                             .div(86400)
                                             .groupby(level=0)
                                             .sum())
      elapsed_time  running_time
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852   

Upvotes: 5

Related Questions