jtanman
jtanman

Reputation: 684

Groupby Apply Custom Function Pandas

I'm trying to apply a custom function in pandas similar to the groupby and mutate functionality in dplyr.

What I'm trying to do is say given a pandas dataframe like this:

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df
  category1 category2  var1  var2
0         a         a    23    59
1         a         b    54    20
2         a         a    48    62
3         b         b    45    76
4         b         a    60    26
5         b         b    13    70

apply some function that returns the same number of elements as the number of elements in the group by:

def myfunc(s):
  return [np.mean(s)] * len(s)

to get this result

df
  category1 category2  var1  var2   var3
0         a         a    23    59   35.5
1         a         b    54    20   54
2         a         a    48    62   35.5
3         b         b    45    76   29
4         b         a    60    26   60
5         b         b    13    70   29

I was thinking of something along the lines of:

df['var3'] = df.groupby(['category1', 'category2'], group_keys=False).apply(lambda x: myfunc(x.var1))

but haven't been able to get the index to match.

In R with dplyr this would be

df <- df %>%
  group_by(category1, category2) %>%
  mutate(
    var3 = myfunc(var1)
  )

So I was able to solve it by using a custom function like:

def myfunc_data(data):

  data['var3'] = myfunc(data.var1)
  return data

and

df = df.groupby(['category1', 'category2']).apply(myfunc_data)

but I guess I was still wondering if there's a way to do it without defining this custom function.

Upvotes: 18

Views: 29418

Answers (4)

Nathan
Nathan

Reputation: 10306

You can technically achieve this using apply, which I'll add here for completeness, but I would recommend using the transform method – it's simpler and faster.

The problem you had was that you returned multiple values which, when using apply, gives you a list for each row. Instead, you can return just a single value and rely upon pandas to be smart about matching up those values (and duplicating where necessary) when you add in the new column. To make this work, though, we need to have the same index as the series returned from groupby/apply. Here's how you could do this (note the modification to myfunc too):

import pandas as pd

def myfunc(s):
    return np.mean(s)

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df = (df.set_index(["category1", "category2"])
         .assign(var3=df.groupby(["category1", "category2"]).var1.apply(myfunc))
         .reset_index()
      )
df

Upvotes: 1

Panwen Wang
Panwen Wang

Reputation: 3835

Easy to replicate this in python using datar

>>> from datar.all import tibble, sample, mean
>>> from pipda import register_func
>>> 
>>> df = tibble(
...   category1=['a','a','a', 'b', 'b','b'],
...   category2=['a', 'b', 'a', 'b', 'a', 'b'],
...   # var1=sample(100, 6),
...   # var2=sample(100, 6)
...   var1=[23, 54, 48, 45, 60, 13],
...   var2=[59, 20, 62, 76, 26, 70]
... )
>>> df
  category1 category2    var1    var2
   <object>  <object> <int64> <int64>
0         a         a      23      59
1         a         b      54      20
2         a         a      48      62
3         b         b      45      76
4         b         a      60      26
5         b         b      13      70
>>>
>>> @register_func(None)
>>> def myfunc(s):
...     return mean(s)
>>>
>>> df >> group_by(
...     f.category1, f.category2
... ) >> mutate(
...     var3 = myfunc(f.var1)
... )
  category1 category2    var1    var2      var3
   <object>  <object> <int64> <int64> <float64>
0         a         a      23      59      35.5
1         a         b      54      20      54.0
2         a         a      48      62      35.5
3         b         b      45      76      29.0
4         b         a      60      26      60.0
5         b         b      13      70      29.0

[Groups: category1, category2 (n=4)]

Disclaimer: I am the author of the datar package.

Upvotes: 1

jezrael
jezrael

Reputation: 863166

Use GroupBy.transform for return Series with same size like original DataFrame, so possible assign to new column:

np.random.seed(123)

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df['var3'] = df.groupby(['category1', 'category2'])['var1'].transform(myfunc)
print (df)
  category1 category2  var1  var2  var3
0         a         a    66    86    82
1         a         b    92    97    92
2         a         a    98    96    82
3         b         b    17    47    37
4         b         a    83    73    83
5         b         b    57    32    37

Alternative with lambda function:

df['var3'] = (df.groupby(['category1', 'category2'])['var1']
                .transform(lambda s: [np.mean(s)] * len(s)))

Upvotes: 14

bubble
bubble

Reputation: 1672

Try the following solution:

df.loc[:,'var3'] = df.groupby(['category1', 'category2']).var1.transform(myfunc)

Upvotes: 4

Related Questions