Reputation: 56
Need to create a new df with a column based on couple of formula. I could create multiple df with group and merge. But is there an efficient way for achieve?
df_1 is as below,
df_1 = pd.DataFrame([['A', '1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],
['A', '1/1/2021','SKU_1','FOB','75,357','$12,407,112'],
['A', '1/1/2021','SKU_1','Price','75,357','$12,407,112'],
['A', '1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],
['B', '1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],
['B', '1/1/2021','SKU_1','FOB','931,866','$50,059,515'],
['B', '1/1/2021','SKU_1','Price','931,866','$62,333,500'],
['B', '1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']],
columns=['Group', 'Month','ID','Cost Type','Volume','Order Cost'])
Formulae for 'Value' column,
Out[df_2]
Group Month ID Cost Type Volume Cost
0 A 1/1/2021 SKU_1 Freight 75,357 $116,570
1 A 1/1/2021 SKU_1 FOB 75,357 $12,407,112
2 A 1/1/2021 SKU_1 Price 75,357 $12,434,063
3 B 1/1/2021 SKU_1 Freight 931,866 $1,378,414
4 B 1/1/2021 SKU_1 FOB 931,866 $50,059,515
5 B 1/1/2021 SKU_1 Price 931,866 $62,490,988
Upvotes: 3
Views: 620
Reputation: 41327
We can apply those formulae with groupby.apply
:
Volume
/Cost
DataFrame per groupCost
tuples and explode
themFirst convert the number strings into actual numbers (or if you're loading data with read_csv
, use the thousands
param):
df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
Given a Group
/Month
/ID
group, return its Volume
and Cost
as a dataframe:
def formulae_df(g):
# set index to Cost Type for simpler indexing
g = g.set_index('Cost Type')
# initialize empty result df
df = pd.DataFrame(columns=['Volume', 'Cost'], index=['Freight', 'FOB', 'Price']).rename_axis('Cost Type')
# fill result df with forumlae
df['Volume'] = g.loc['FOB', 'Volume']
df.loc['Freight', 'Cost'] = abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost']
df.loc['FOB', 'Cost'] = g.loc['FOB', 'Order Cost']
df.loc['Price', 'Cost'] = g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost']
return df
Then apply formulae_df
with groupby.apply
:
df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()
# Group Month ID Cost Type Volume Cost
# 0 A 1/1/2021 SKU_1 Freight 75357 116570
# 1 A 1/1/2021 SKU_1 FOB 75357 12407112
# 2 A 1/1/2021 SKU_1 Price 75357 12458212
# 3 B 1/1/2021 SKU_1 Freight 931866 1378414
# 4 B 1/1/2021 SKU_1 FOB 931866 50059515
# 5 B 1/1/2021 SKU_1 Price 931866 62490987
explode
Since each group has one Volume
vs multiple Cost
, we can generate Cost
as a list/tuple and explode
them:
The first step is still to convert the number strings into actual numbers:
df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
Given a Group
/Month
/ID
group, compute its Volume
(value) and Cost
(tuple):
def formulae_series(g):
# set index for easy loc access
g = g.set_index('Cost Type')
# compute formulae
volume = g.loc['FOB', 'Volume']
costs = {
'Freight': abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost'],
'FOB': g.loc['FOB', 'Order Cost'],
'Price': g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost'],
}
# return volume as a value and costs as tuples
return pd.Series({'Cost Type': costs.keys(), 'Volume': volume, 'Cost': costs.values()})
When we apply formulae_series
with groupby.apply
, note how the Cost*
columns contain tuples:
df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_series)
# Cost Type Volume Cost
# Group Month ID
# A 1/1/2021 SKU_1 (Freight, FOB, Price) 75357 (116570, 12407112, 12458212)
# B 1/1/2021 SKU_1 (Freight, FOB, Price) 931866 (1378414, 50059515, 62490987)
So now explode
those tuples into rows:
df_2 = df_2.explode(['Cost Type', 'Cost']).reset_index()
# Group Month ID Cost Type Volume Cost
# 0 A 1/1/2021 SKU_1 Freight 75357 116570
# 1 A 1/1/2021 SKU_1 FOB 75357 12407112
# 2 A 1/1/2021 SKU_1 Price 75357 12458212
# 3 B 1/1/2021 SKU_1 Freight 931866 1378414
# 4 B 1/1/2021 SKU_1 FOB 931866 50059515
# 5 B 1/1/2021 SKU_1 Price 931866 62490987
Here are the steps recombined (including an optional conversion back to commas/dollars):
## load df_1
df_1 = pd.DataFrame([['A','1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],['A','1/1/2021','SKU_1','FOB','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Price','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],['B','1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],['B','1/1/2021','SKU_1','FOB','931,866','$50,059,515'],['B','1/1/2021','SKU_1','Price','931,866','$62,333,500'],['B','1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']],columns=['Group','Month','ID','Cost Type','Volume','Order Cost'])
## convert to numerics
df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
## dataframe option
df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()
## or apply formulae and explode costs
# df_2 = (df_1.groupby(['Group', 'Month', 'ID'])
# .apply(formulae_series)
# .explode(['Cost Type', 'Cost'])
# .reset_index())
## optional: revert to comma/dollar strings
df_2['Volume'] = df_2['Volume'].map('{:,}'.format)
df_2['Cost'] = df_2['Cost'].map('${:,}'.format)
Output:
Group Month ID Cost Type Volume Cost
0 A 1/1/2021 SKU_1 Freight 75,357 $116,570
1 A 1/1/2021 SKU_1 FOB 75,357 $12,407,112
2 A 1/1/2021 SKU_1 Price 75,357 $12,458,212
3 B 1/1/2021 SKU_1 Freight 931,866 $1,378,414
4 B 1/1/2021 SKU_1 FOB 931,866 $50,059,515
5 B 1/1/2021 SKU_1 Price 931,866 $62,490,987
Upvotes: 4
Reputation: 5355
I find it a bit difficult to understand when it is a calculation, and when you just want to create a text with the calculation, where the numbers come from etc.
I have provided an example below after my understanding of the issue (feel free to comment if I've misunderstood):
You could loop over each row, and then do the calculations for the value (and use the replace
function for the Formulae)
replace_dict = {"Freight":"ABS(Backhaul)+Vendor Delivered", "Price":"Price - Backhaul",...,}
df["Definition or Formulae"] = df["Cost Type"].replace(replace_dict) #Replace "Freight with "ABS(Backhaul)+Vendor" etc in a new column called "Definition and Formulae
#Do some calculations with the values in each row
def get_value(row):
cost_type = row["Cost Type"]
if cost_type == "Freight":
return row["Backhaul"].abs()+ row["Vendor Delivered"]
if cost_type== "Price":
return row["Price"] - row["Backhaul"]
.
.
return row["Value"] # If nothing special to do, just return the value
df["Value"] = df.apply(get_value, axis=1)
Upvotes: 2