Reputation: 23
I am stuck on the following problem:
I have this dataset (dummy values)
item price units
A 10 1
B 20 5
C 5 8
D 6 3
Out of it, I calculate the cumulative sum of both via:
threshold_price = 31
threshold_units = 8
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold_price:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cs_price'] = ls
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.units <= threshold_units:
cumsum += row.units
else:
last_reset = cumsum
cumsum = row.units
ls.append(cumsum)
df['cs_units'] = ls
From here I get:
item price units cs_price cs_units
A 10 1 10 1
B 20 5 30 6
C 5 8 5 8
D 6 3 11 3
I want to have a new column which we can call "Store" which would be equal to the first reset of any of the two columns and incrementing for each reset.
Given above example:
item price units cs_price cs_units store
A 10 1 10 1 0
B 20 5 30 6 0
C 5 8 5 8 1 #+1 because of price
D 6 3 11 3 2 #+1 because of units
The main issue I have through this approach is that, cs_price and cs_units will not reset when a new store is attributed, which is something I need.
Ideal output should be following:
item price units cs_price cs_units store
A 10 1 10 1 0
B 20 5 30 6 0
C 5 8 5 8 1 #+1 because of price, cs_units resets
D 6 3 6 3 2 #+1 because of units, cs_price resets
E 7 4 13 7 2
F 8 2 8 2 3
G 20 2 28 4 3
H 6 3 6 3 4
My current output (which is not what I want) is:
item price units cs_price cs_units store
A 10 1 10 1 0
B 20 5 30 6 0
C 5 8 5 8 1 #+1 because of cs_price
D 6 3 11 3 1
E 7 4 18 7 1
F 8 2 26 2 2 #+1 because of cs_units but cs_price does not reset
G 20 2 20 4 2
H 6 3 26 7 2
Full code (I did not include match columns above, I hope it is still readable):
threshold_price = 31
threshold_units = 8
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.price <= threshold_price:
cumsum += row.price
else:
last_reset = cumsum
cumsum = row.price
ls.append(cumsum)
df['cs_price'] = ls
ls = []
cumsum = 0
lastreset = 0
for _, row in df.iterrows():
if cumsum + row.units <= threshold_units:
cumsum += row.units
else:
last_reset = cumsum
cumsum = row.units
ls.append(cumsum)
df['cs_units'] = ls
df['match'] = df.cs_price < df.cs_price.shift()
df["match"] = df["match"].astype(int)
df['match2'] = df.cs_units < df.cs_units.shift()
df["match2"] = df["match2"].astype(int)
df["store_prep"] = df["match"].cumsum()
df["store_prep2"] = df["match2"].cumsum()
df["store"] = df[["store_prep","store_prep2"]].max(axis=1)
Upvotes: 1
Views: 93
Reputation: 682
Because your "reset" condition is dependent on both the price
and units
columns, you should loop through both columns together. During the loop, keep track of when a reset occurs, then do a cumsum on that reset list to get your store
column.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'item': ['A','B','C','D','E','F','G','H'],
'price': [10,20,5,6,7,8,20,6],
'units': [1,5,8,3,4,2,2,3]})
threshold_price = 31
threshold_units = 8
reset = []
list_price = []
list_units = []
cs_price = 0
cs_units = 0
for price, units in zip(df['price'].to_list(), df['units'].to_list()):
if cs_price + price > threshold_price or cs_units + units > threshold_units:
reset.append(True)
cs_price = price
cs_units = units
else:
reset.append(False)
cs_price += price
cs_units += units
list_price.append(cs_price)
list_units.append(cs_units)
df['cs_price'] = list_price
df['cs_units'] = list_units
df['store'] = np.cumsum(reset)
print(df)
item price units cs_price cs_units store
0 A 10 1 10 1 0
1 B 20 5 30 6 0
2 C 5 8 5 8 1
3 D 6 3 6 3 2
4 E 7 4 13 7 2
5 F 8 2 8 2 3
6 G 20 2 28 4 3
7 H 6 3 6 3 4
Upvotes: 1