Reputation: 77424
Is there a pandas built-in way to apply two different aggregating functions f1, f2
to the same column df["returns"]
, without having to call agg()
multiple times?
Example dataframe:
import pandas as pd
import datetime as dt
import numpy as np
pd.np.random.seed(0)
df = pd.DataFrame({
"date" : [dt.date(2012, x, 1) for x in range(1, 11)],
"returns" : 0.05 * np.random.randn(10),
"dummy" : np.repeat(1, 10)
})
The syntactically wrong, but intuitively right, way to do it would be:
# Assume `f1` and `f2` are defined for aggregating.
df.groupby("dummy").agg({"returns": f1, "returns": f2})
Obviously, Python doesn't allow duplicate keys. Is there any other manner for expressing the input to agg()
? Perhaps a list of tuples [(column, function)]
would work better, to allow multiple functions applied to the same column? But agg()
seems like it only accepts a dictionary.
Is there a workaround for this besides defining an auxiliary function that just applies both of the functions inside of it? (How would this work with aggregation anyway?)
Upvotes: 292
Views: 300889
Reputation: 11603
you can also use a lambda within a NamedAggregation
df.groupby('dummy').returns.agg({
'summed' : pd.NamedAgg(column='date', aggfunc=lambda series: sum(series.values()),
'joined' : pd.NamedAgg(column='returns', aggfunc=lambda series: ','.join(series.values())),
})
Upvotes: 0
Reputation: 36184
As of 2022-06-20, the below is the accepted practice for aggregations:
df.groupby('dummy').agg(
Mean=('returns', np.mean),
Sum=('returns', np.sum))
see this answer for more information.
Below the fold included for historical versions of pandas
.
You can simply pass the functions as a list:
In [20]: df.groupby("dummy").agg({"returns": [np.mean, np.sum]})
Out[20]:
mean sum
dummy
1 0.036901 0.369012
or as a dictionary:
In [21]: df.groupby('dummy').agg({'returns':
{'Mean': np.mean, 'Sum': np.sum}})
Out[21]:
returns
Mean Sum
dummy
1 0.036901 0.369012
Upvotes: 334
Reputation: 23081
If you have multiple columns that you need to apply the same multiple aggregation functions on, the simplest way (imo) is to use a dictionary comprehension.
#setup
df = pd.DataFrame({'dummy': [0, 1, 1], 'A': range(3), 'B':range(1, 4), 'C':range(2, 5)})
# aggregation
df.groupby("dummy").agg({k: ['sum', 'mean'] for k in ['A', 'B', 'C']})
The above results in a dataframe with MultiIndex column. If a flat custom column names are desired, named aggregation is the way to go (as suggested in the other answers on here).
As stated in the docs, the keys should be the output column names and the values should be tuples (column, aggregation function)
for named aggregations. Since there are multiple columns and multiple functions, this results in a nested structure. To flatten it into a single dictionary, you can either use collections.ChainMap()
or a nested loop.
Also, if you prefer the grouper column (dummy
) as a column (not index), specify as_index=False
in groupby()
.
from collections import ChainMap
# convert a list of dictionaries into a dictionary
dct = dict(ChainMap(*reversed([{f'{k}_total': (k, 'sum'), f'{k}_mean': (k, 'mean')} for k in ['A','B','C']])))
# {'A_total': ('A', 'sum'), 'A_avg': ('A', 'mean'), 'B_total': ('B', 'sum'), 'B_avg': ('B', 'mean'), 'C_total': ('C', 'sum'), 'C_avg': ('C', 'mean')}
# the same result obtained by a nested loop
# dct = {k:v for k in ['A','B','C'] for k,v in [(f'{k}_total', (k, 'sum')), (f'{k}_avg', (k, 'mean'))]}
# aggregation
df.groupby('dummy', as_index=False).agg(**dct)
Upvotes: 2
Reputation: 402413
TLDR; Pandas groupby.agg
has a new, easier syntax for specifying (1) aggregations on multiple columns, and (2) multiple aggregations on a column. So, to do this for pandas >= 0.25, use
df.groupby('dummy').agg(Mean=('returns', 'mean'), Sum=('returns', 'sum'))
Mean Sum
dummy
1 0.036901 0.369012
OR
df.groupby('dummy')['returns'].agg(Mean='mean', Sum='sum')
Mean Sum
dummy
1 0.036901 0.369012
Pandas has changed the behavior of GroupBy.agg
in favour of a more intuitive syntax for specifying named aggregations. See the 0.25 docs section on Enhancements as well as relevant GitHub issues GH18366 and GH26512.
From the documentation,
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in
GroupBy.agg()
, known as “named aggregation”, where
- The keywords are the output column names
- The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. Pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
You can now pass a tuple via keyword arguments. The tuples follow the format of (<colName>, <aggFunc>)
.
import pandas as pd
pd.__version__
# '0.25.0.dev0+840.g989f912ee'
# Setup
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]
})
df.groupby('kind').agg(
max_height=('height', 'max'), min_weight=('weight', 'min'),)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
Alternatively, you can use pd.NamedAgg
(essentially a namedtuple) which makes things more explicit.
df.groupby('kind').agg(
max_height=pd.NamedAgg(column='height', aggfunc='max'),
min_weight=pd.NamedAgg(column='weight', aggfunc='min')
)
max_height min_weight
kind
cat 9.5 7.9
dog 34.0 7.5
It is even simpler for Series, just pass the aggfunc to a keyword argument.
df.groupby('kind')['height'].agg(max_height='max', min_height='min')
max_height min_height
kind
cat 9.5 9.1
dog 34.0 6.0
Lastly, if your column names aren't valid python identifiers, use a dictionary with unpacking:
df.groupby('kind')['height'].agg(**{'max height': 'max', ...})
In more recent versions of pandas leading upto 0.24, if using a dictionary for specifying column names for the aggregation output, you will get a FutureWarning
:
df.groupby('dummy').agg({'returns': {'Mean': 'mean', 'Sum': 'sum'}})
# FutureWarning: using a dict with renaming is deprecated and will be removed
# in a future version
Using a dictionary for renaming columns is deprecated in v0.20. On more recent versions of pandas, this can be specified more simply by passing a list of tuples. If specifying the functions this way, all functions for that column need to be specified as tuples of (name, function) pairs.
df.groupby("dummy").agg({'returns': [('op1', 'sum'), ('op2', 'mean')]})
returns
op1 op2
dummy
1 0.328953 0.032895
Or,
df.groupby("dummy")['returns'].agg([('op1', 'sum'), ('op2', 'mean')])
op1 op2
dummy
1 0.328953 0.032895
Upvotes: 265
Reputation: 16970
Would something like this work:
In [7]: df.groupby('dummy').returns.agg({'func1' : lambda x: x.sum(), 'func2' : lambda x: x.prod()})
Out[7]:
func2 func1
dummy
1 -4.263768e-16 -0.188565
Upvotes: 7