Reputation: 684
I'm trying to apply a custom function in pandas similar to the groupby and mutate functionality in dplyr.
What I'm trying to do is say given a pandas dataframe like this:
df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
'category2':['a', 'b', 'a', 'b', 'a', 'b'],
'var1':np.random.randint(0,100,6),
'var2':np.random.randint(0,100,6)}
)
df
category1 category2 var1 var2
0 a a 23 59
1 a b 54 20
2 a a 48 62
3 b b 45 76
4 b a 60 26
5 b b 13 70
apply some function that returns the same number of elements as the number of elements in the group by:
def myfunc(s):
return [np.mean(s)] * len(s)
to get this result
df
category1 category2 var1 var2 var3
0 a a 23 59 35.5
1 a b 54 20 54
2 a a 48 62 35.5
3 b b 45 76 29
4 b a 60 26 60
5 b b 13 70 29
I was thinking of something along the lines of:
df['var3'] = df.groupby(['category1', 'category2'], group_keys=False).apply(lambda x: myfunc(x.var1))
but haven't been able to get the index to match.
In R with dplyr this would be
df <- df %>%
group_by(category1, category2) %>%
mutate(
var3 = myfunc(var1)
)
So I was able to solve it by using a custom function like:
def myfunc_data(data):
data['var3'] = myfunc(data.var1)
return data
and
df = df.groupby(['category1', 'category2']).apply(myfunc_data)
but I guess I was still wondering if there's a way to do it without defining this custom function.
Upvotes: 18
Views: 29418
Reputation: 10306
You can technically achieve this using apply
, which I'll add here for completeness, but I would recommend using the transform
method – it's simpler and faster.
The problem you had was that you returned multiple values which, when using apply
, gives you a list for each row. Instead, you can return just a single value and rely upon pandas
to be smart about matching up those values (and duplicating where necessary) when you add in the new column. To make this work, though, we need to have the same index as the series returned from groupby
/apply
. Here's how you could do this (note the modification to myfunc
too):
import pandas as pd
def myfunc(s):
return np.mean(s)
df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
'category2':['a', 'b', 'a', 'b', 'a', 'b'],
'var1':np.random.randint(0,100,6),
'var2':np.random.randint(0,100,6)}
)
df = (df.set_index(["category1", "category2"])
.assign(var3=df.groupby(["category1", "category2"]).var1.apply(myfunc))
.reset_index()
)
df
Upvotes: 1
Reputation: 3835
Easy to replicate this in python using datar
>>> from datar.all import tibble, sample, mean
>>> from pipda import register_func
>>>
>>> df = tibble(
... category1=['a','a','a', 'b', 'b','b'],
... category2=['a', 'b', 'a', 'b', 'a', 'b'],
... # var1=sample(100, 6),
... # var2=sample(100, 6)
... var1=[23, 54, 48, 45, 60, 13],
... var2=[59, 20, 62, 76, 26, 70]
... )
>>> df
category1 category2 var1 var2
<object> <object> <int64> <int64>
0 a a 23 59
1 a b 54 20
2 a a 48 62
3 b b 45 76
4 b a 60 26
5 b b 13 70
>>>
>>> @register_func(None)
>>> def myfunc(s):
... return mean(s)
>>>
>>> df >> group_by(
... f.category1, f.category2
... ) >> mutate(
... var3 = myfunc(f.var1)
... )
category1 category2 var1 var2 var3
<object> <object> <int64> <int64> <float64>
0 a a 23 59 35.5
1 a b 54 20 54.0
2 a a 48 62 35.5
3 b b 45 76 29.0
4 b a 60 26 60.0
5 b b 13 70 29.0
[Groups: category1, category2 (n=4)]
Disclaimer: I am the author of the datar
package.
Upvotes: 1
Reputation: 863166
Use GroupBy.transform
for return Series
with same size like original DataFrame
, so possible assign to new column:
np.random.seed(123)
df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
'category2':['a', 'b', 'a', 'b', 'a', 'b'],
'var1':np.random.randint(0,100,6),
'var2':np.random.randint(0,100,6)}
)
df['var3'] = df.groupby(['category1', 'category2'])['var1'].transform(myfunc)
print (df)
category1 category2 var1 var2 var3
0 a a 66 86 82
1 a b 92 97 92
2 a a 98 96 82
3 b b 17 47 37
4 b a 83 73 83
5 b b 57 32 37
Alternative with lambda function
:
df['var3'] = (df.groupby(['category1', 'category2'])['var1']
.transform(lambda s: [np.mean(s)] * len(s)))
Upvotes: 14
Reputation: 1672
Try the following solution:
df.loc[:,'var3'] = df.groupby(['category1', 'category2']).var1.transform(myfunc)
Upvotes: 4