Reputation: 2637
Using Pandas 1.0.1, given this demo dataframe I want to substitute the numeric columns weight and price with their median values calculated on the subgroup 'box' (without modifying the original dataframe):
import pandas as pd
import numpy as np
df = pd.DataFrame({'box': ['a100','a100','a100','a200','a200','a300','a300','a300','a300'], 'id_in_box': ['2x', '1x', '3x', '2x', '1x', '3x', '1x', '2x', '4x'], 'weight': [2, 1, 2, 3, 4, 2, 8, 6, 4.5], 'price': [1.5, 3.2, 2.1, 3.3, 1.5, 3.2, 2.1, 3.3, 4.4]})
df
box id_in_box weight price
0 a100 2x 2.0 1.5
1 a100 1x 1.0 3.2
2 a100 3x 2.0 2.1
3 a200 2x 3.0 3.3
4 a200 1x 4.0 1.5
5 a300 3x 2.0 3.2
6 a300 1x 8.0 2.1
7 a300 2x 6.0 3.3
8 a300 4x 4.5 4.4
Actual:
df.groupby('box')[['weight', 'price']].transform(lambda x: np.median(x))
weight price
0 2.00 2.10
1 2.00 2.10
2 2.00 2.10
3 3.50 2.40
4 3.50 2.40
5 5.25 3.25
6 5.25 3.25
7 5.25 3.25
8 5.25 3.25
Expected:
box id_in_box weight price
0 a100 2x 2.00 2.10
1 a100 1x 2.00 2.10
2 a100 3x 2.00 2.10
3 a200 2x 3.50 2.40
4 a200 1x 3.50 2.40
5 a300 3x 5.25 3.25
6 a300 1x 5.25 3.25
7 a300 2x 5.25 3.25
8 a300 4x 5.25 3.25
How can I achieve this in the most efficient way?
Upvotes: 5
Views: 1688
Reputation: 35115
It is a primitive approach.
gb = df.groupby(['box'])['weight'].transform(lambda x: np.median(x))
gb1 = df.groupby(['box'])['price'].transform(lambda x: np.median(x))
df['weight'] = gb
df['price'] = gb1
Upvotes: 0
Reputation: 305
You will need to combine the 2 outputs in some fashion (as mentioned in the comments). Copying works (as you suggested) but would require 2 lines of code:
df_new = df.copy()
df_new[['weight', 'price']] = df_new.groupby('box')[['weight', 'price']].transform(lambda x: np.median(x))
A single line solution without overwriting the original dataframe is via join, albeit not as efficient for large data:
df_new = pd.merge(
left=df[['box', 'id_in_box']],
right=df.groupby('box')[['weight', 'price']].median(),
left_on='box',
right_index=True
)
Upvotes: 2
Reputation: 13858
new_df = df[df.columns[:2]].merge(
df.groupby('box', as_index=False)[['weight', 'price']].median(),
how='left')
Result:
box id_in_box weight price
0 a100 2x 2.00 2.10
1 a100 1x 2.00 2.10
2 a100 3x 2.00 2.10
3 a200 2x 3.50 2.40
4 a200 1x 3.50 2.40
5 a300 3x 5.25 3.25
6 a300 1x 5.25 3.25
7 a300 2x 5.25 3.25
8 a300 4x 5.25 3.25
Upvotes: 4