Reputation: 3036
I have a dataframe like the following:
+-----------------------+
| id | weight | value |
+-----------------------+
| i1 | 1 | 0 |
| i1 | 2 | 3 |
| i1 | 3 | 6 |
| i2 | 1 | 2 |
| i2 | 2 | 2 |
| i3 | 2 | 2 |
+-----------------------+
and I would like to do a couple of aggregations to calculate the following per id
:
The expected output is like the following:
+------------------------------------------+
| new_id | avg_val | val_sum | val_count |
+------------------------------------------+
| i1 | 4 | 9 | 2 |
| i2 | 2 | 4 | 2 |
| i3 | 2 | 2 | 1 |
+------------------------------------------+
Please notice that the id
column name is an input which might be different or similar to the old one.
I know I can achieve this in multiple approaches, but what is the recommended and fastest one knowing that the amount of data we are dealing with is very big?
Possible solutions that came to my mind:
group and merge for every aggregation
in_df = pd.DataFrame({
'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
'weight': [1, 2, 3, 1, 2, 2],
'value': [0, 3, 6, 2, 2, 2]
})
out_df = pd.DataFrame()
out_df['new_id'] = in_df['id'].unique()
grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
out_df = pd.merge(out_df,
grouped_df,
left_on='new_id',
right_on='id',
how='left')
out_df.drop('id') # Dangerous if the `new_id` name is similar to `id`
# Go on like this for every aggregation ...
print(out_df)
group and update for every aggregation
in_df = pd.DataFrame({
'id': ['i1', 'i1', 'i1', 'i2', 'i2', 'i3'],
'weight': [1, 2, 3, 1, 2, 2],
'value': [0, 3, 6, 2, 2, 2]
})
out_df = pd.DataFrame(columns=['new_id', 'avg_val', 'val_sum', 'val_count'])
out_df['new_id'] = in_df['id'].unique()
out_df = out_df.set_index('new_id')
grouped_df = in_df.groupby('id').apply(lambda group: (group['weight'] * group['value']).sum() / max(group['weight'].sum(), 0.001)).reset_index(name='avg_val')
grouped_df = grouped_df.set_index('id')
out_df.update(grouped_df)
# Go on like this for every aggregation ...
print(out_df)
Upvotes: 0
Views: 62
Reputation: 8631
You need:
res = df.assign(wv = df['weight'].mul(df['value'])).groupby('id').agg({
'wv': 'sum',
'weight': 'sum',
'value':['sum', np.count_nonzero]
})
res['avg_val'] = res['wv'] / res['weight']
res = res.drop(['wv', 'weight'],1)
res.columns = ['val_sum', 'val_count', 'avg_val']
Output:
val_sum val_count avg_val
id
i1 9 2 4.0
i2 4 2 2.0
i3 2 1 2.0
Both of your questions use apply
+ lambda
, so that you can loop over each group (among many other things) while this code does no such things. That is why it is faster and efficient.
Upvotes: 2