Apply multiple aggregations on multiple columns on a Pandas dataframe

Question

I have a dataframe like the following:

+-----------------------+
|  id  | weight | value |
+-----------------------+
|  i1  |   1    |   0   |
|  i1  |   2    |   3   |
|  i1  |   3    |   6   |
|  i2  |   1    |   2   |
|  i2  |   2    |   2   |
|  i3  |   2    |   2   |
+-----------------------+

and I would like to do a couple of aggregations to calculate the following per id:

average weighted value
total value
number of values that are not zero

The expected output is like the following:

+------------------------------------------+
|  new_id  | avg_val | val_sum | val_count |
+------------------------------------------+
|    i1    |    4    |    9    |     2     |
|    i2    |    2    |    4    |     2     |
|    i3    |    2    |    2    |     1     |
+------------------------------------------+

Please notice that the id column name is an input which might be different or similar to the old one.

I know I can achieve this in multiple approaches, but what is the recommended and fastest one knowing that the amount of data we are dealing with is very big?

Possible solutions that came to my mind:

group and merge for every aggregation

in_df = pd.DataFrame({
    'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
    'weight': [1, 2, 3, 1, 2, 2],
    'value': [0, 3, 6, 2, 2, 2]
})

out_df = pd.DataFrame()
out_df['new_id'] = in_df['id'].unique()

grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
out_df = pd.merge(out_df,
                  grouped_df,
                  left_on='new_id',
                  right_on='id',
                  how='left')
out_df.drop('id')  # Dangerous if the `new_id` name is similar to `id`

# Go on like this for every aggregation ...

print(out_df)

group and update for every aggregation

in_df = pd.DataFrame({
    'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
    'weight': [1, 2, 3, 1, 2, 2],
    'value': [0, 3, 6, 2, 2, 2]
})

out_df = pd.DataFrame(columns=['new_id', 'avg_val', 'val_sum', 'val_count'])
out_df['new_id'] = in_df['id'].unique()
out_df = out_df.set_index('new_id')

grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
grouped_df = grouped_df.set_index('id')

out_df.update(grouped_df)

# Go on like this for every aggregation ...

print(out_df)

harpan · Accepted Answer

You need:

res = df.assign(wv = df['weight'].mul(df['value'])).groupby('id').agg({
    'wv': 'sum',
    'weight': 'sum',
    'value':['sum', np.count_nonzero]
})
res['avg_val'] = res['wv'] / res['weight']
res = res.drop(['wv', 'weight'],1) 
res.columns = ['val_sum', 'val_count', 'avg_val']

Output:

        val_sum  val_count  avg_val
  id                                 
  i1          9          2      4.0
  i2          4          2      2.0
  i3          2          1      2.0

Both of your questions use apply + lambda, so that you can loop over each group (among many other things) while this code does no such things. That is why it is faster and efficient.

Apply multiple aggregations on multiple columns on a Pandas dataframe

Answers (1)

Related Questions