gx.
gx.

Reputation: 129

How to efficiently aggregate the same column using different aggregate functions?

Consider the following data:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "value": [10, 50, 90, 25, 75]})
df
   id  value
0   1     10
1   1     50
2   1     90
3   2     25
4   2     75

How can one efficiently and elegantly aggregate the column value by id considering multiple aggregate functions on the same column, for instance:

    value_min  value_max  value_mean  value_sum  value_max_diff
id
1          10         90          50        150              80
2          25         75          50        100              50

One approach is to create multiple pivot tables, one by each aggregate function (built-in or not), and then concatenate the result, such as:

def max_diff(x):
    return np.max(x) - np.min(x)
funcs = [np.min, np.max, np.mean, np.sum, max_diff]

tmp = [pd.pivot_table(df, index=["id"], values=["value"],
                      aggfunc={"value": f}).rename(columns={"value": f"value_{f.__name__}"}) for f in funcs]
pivot = pd.concat(tmp, axis=1)
pivot
    value_amin  value_amax  value_mean  value_sum  value_max_diff
id
1           10          90          50        150              80
2           25          75          50        100              50

However, it seems to me that this approach is not very scalable, considering multiple columns and multiple and even different aggregate functions per column. As Raymond Hettinger says: "There must be a better way!". So, which one would be better?

Thanks in advance!

Upvotes: 0

Views: 84

Answers (1)

Pablo C
Pablo C

Reputation: 4761

You can do it with pandas.DataFrame.groupby and with DataFrameGroupBy.aggregate:

df.groupby("id").agg(funcs)

#   value                        
#    amin amax mean  sum max_diff
#id                              
#1     10   90   50  150       80
#2     25   75   50  100       50


Solution for multiple columns and different functions:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2], "value1": [10, 50, 90, 25, 75], "value2": [1, 5, 6, 3, 8]})

#   id  value1  value2
#0   1      10       1
#1   1      50       5
#2   1      90       6
#3   2      25       3
#4   2      75       8

funcs = {"value1": [np.min, np.max], "value2": [np.mean, np.sum, max_diff]}

df.groupby("id").agg(funcs)

#   value1      value2             
#     amin amax   mean sum max_diff
#id                                
#1      10   90    4.0  12        5
#2      25   75    5.5  11        5

Upvotes: 4

Related Questions