Mousa
Mousa

Reputation: 3036

Apply multiple aggregations on multiple columns on a Pandas dataframe

I have a dataframe like the following:

+-----------------------+
|  id  | weight | value |
+-----------------------+
|  i1  |   1    |   0   |
|  i1  |   2    |   3   |
|  i1  |   3    |   6   |
|  i2  |   1    |   2   |
|  i2  |   2    |   2   |
|  i3  |   2    |   2   |
+-----------------------+

and I would like to do a couple of aggregations to calculate the following per id:

The expected output is like the following:

+------------------------------------------+
|  new_id  | avg_val | val_sum | val_count |
+------------------------------------------+
|    i1    |    4    |    9    |     2     |
|    i2    |    2    |    4    |     2     |
|    i3    |    2    |    2    |     1     |
+------------------------------------------+

Please notice that the id column name is an input which might be different or similar to the old one.

I know I can achieve this in multiple approaches, but what is the recommended and fastest one knowing that the amount of data we are dealing with is very big?

Possible solutions that came to my mind:

  1. group and merge for every aggregation

    in_df = pd.DataFrame({
        'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
        'weight': [1, 2, 3, 1, 2, 2],
        'value': [0, 3, 6, 2, 2, 2]
    })
    
    out_df = pd.DataFrame()
    out_df['new_id'] = in_df['id'].unique()
    
    grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
    out_df = pd.merge(out_df,
                      grouped_df,
                      left_on='new_id',
                      right_on='id',
                      how='left')
    out_df.drop('id')  # Dangerous if the `new_id` name is similar to `id`
    
    # Go on like this for every aggregation ...
    
    print(out_df)
    
  2. group and update for every aggregation

    in_df = pd.DataFrame({
        'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
        'weight': [1, 2, 3, 1, 2, 2],
        'value': [0, 3, 6, 2, 2, 2]
    })
    
    out_df = pd.DataFrame(columns=['new_id', 'avg_val', 'val_sum', 'val_count'])
    out_df['new_id'] = in_df['id'].unique()
    out_df = out_df.set_index('new_id')
    
    grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
    grouped_df = grouped_df.set_index('id')
    
    out_df.update(grouped_df)
    
    # Go on like this for every aggregation ...
    
    print(out_df)
    

Upvotes: 0

Views: 62

Answers (1)

harpan
harpan

Reputation: 8631

You need:

res = df.assign(wv = df['weight'].mul(df['value'])).groupby('id').agg({
    'wv': 'sum',
    'weight': 'sum',
    'value':['sum', np.count_nonzero]
})
res['avg_val'] = res['wv'] / res['weight']
res = res.drop(['wv', 'weight'],1) 
res.columns = ['val_sum', 'val_count', 'avg_val']

Output:

        val_sum  val_count  avg_val
  id                                 
  i1          9          2      4.0
  i2          4          2      2.0
  i3          2          1      2.0

Both of your questions use apply + lambda, so that you can loop over each group (among many other things) while this code does no such things. That is why it is faster and efficient.

Upvotes: 2

Related Questions