Reputation: 129
Consider the following data:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "value": [10, 50, 90, 25, 75]})
df
id value
0 1 10
1 1 50
2 1 90
3 2 25
4 2 75
How can one efficiently and elegantly aggregate the column value by id considering multiple aggregate functions on the same column, for instance:
value_min value_max value_mean value_sum value_max_diff
id
1 10 90 50 150 80
2 25 75 50 100 50
One approach is to create multiple pivot tables, one by each aggregate function (built-in or not), and then concatenate the result, such as:
def max_diff(x):
return np.max(x) - np.min(x)
funcs = [np.min, np.max, np.mean, np.sum, max_diff]
tmp = [pd.pivot_table(df, index=["id"], values=["value"],
aggfunc={"value": f}).rename(columns={"value": f"value_{f.__name__}"}) for f in funcs]
pivot = pd.concat(tmp, axis=1)
pivot
value_amin value_amax value_mean value_sum value_max_diff
id
1 10 90 50 150 80
2 25 75 50 100 50
However, it seems to me that this approach is not very scalable, considering multiple columns and multiple and even different aggregate functions per column. As Raymond Hettinger says: "There must be a better way!". So, which one would be better?
Thanks in advance!
Upvotes: 0
Views: 84
Reputation: 4761
You can do it with pandas.DataFrame.groupby
and with DataFrameGroupBy.aggregate
:
df.groupby("id").agg(funcs)
# value
# amin amax mean sum max_diff
#id
#1 10 90 50 150 80
#2 25 75 50 100 50
Solution for multiple columns and different functions:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "value1": [10, 50, 90, 25, 75], "value2": [1, 5, 6, 3, 8]})
# id value1 value2
#0 1 10 1
#1 1 50 5
#2 1 90 6
#3 2 25 3
#4 2 75 8
funcs = {"value1": [np.min, np.max], "value2": [np.mean, np.sum, max_diff]}
df.groupby("id").agg(funcs)
# value1 value2
# amin amax mean sum max_diff
#id
#1 10 90 4.0 12 5
#2 25 75 5.5 11 5
Upvotes: 4