Reputation: 355
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 201901, 10, 201801, 201801],
['B', 201902, 11, 201801, 201802],
['B', 201903, 13, 201801, 201803],
['B', 201905, 18, 201801, 201805],
['A', 201906, 80, 201801, 201806],
['A', 202001, 10, 201901, 201901],
['A', 202002, 11, 201901, 201902],
['A', 202003, 13, 201901, 201903],
['A', 202004, 18, 201901, 201904],
['B', 202005, 80, 201901, 201905],
['A', 202006, 80, 201901, 201906],
['B', 201901, 10, 201801, 201801],
['A', 201902, 11, 201801, 201802],
['A', 201903, 13, 201801, 201803],
['A', 201905, 18, 201801, 201805],
['B', 201906, 80, 201801, 201806],
['B', 202001, 10, 201901, 201901],
['B', 202002, 11, 201901, 201902],
['B', 202003, 13, 201901, 201903],
['B', 202004, 18, 201901, 201904],
['A', 202005 ,80, 201901, 201905],
['B', 202006 ,80, 201901, 201906]],
columns = ['Store','yearweek','sales','Start_PY','PY'])
df
From the df
(note that week 201904 is missing) above I would like to add a column 'Sales_PY'
with on each line the sum of sales in the previous yearweeks per store.
Something like this:
Store | yearweek | sales | Start_PY | PY | sales_PY |
---|---|---|---|---|---|
A | 201901 | 100 | 201801 | 201801 | NaN |
B | 201902 | 11 | 201801 | 201802 | NaN |
B | 201903 | 13 | 201801 | 201803 | NaN |
B | 201905 | 18 | 201801 | 201805 | NaN |
A | 201906 | 800 | 201801 | 201806 | NaN |
A | 202001 | 100 | 201901 | 201901 | 100.0 |
A | 202002 | 110 | 201901 | 201902 | 210.0 |
A | 202003 | 130 | 201901 | 201903 | 340.0 |
A | 202004 | 180 | 201901 | 201904 | 340.0 |
B | 202005 | 80 | 201901 | 201905 | 52.0 |
A | 202006 | 800 | 201901 | 201906 | 1320.0 |
B | 201901 | 10 | 201801 | 201801 | NaN |
A | 201902 | 110 | 201801 | 201802 | NaN |
A | 201903 | 130 | 201801 | 201803 | NaN |
A | 201905 | 180 | 201801 | 201805 | NaN |
B | 201906 | 80 | 201801 | 201806 | NaN |
B | 202001 | 10 | 201901 | 201901 | 10.0 |
B | 202002 | 11 | 201901 | 201902 | 21.0 |
B | 202003 | 13 | 201901 | 201903 | 34.0 |
B | 202004 | 18 | 201901 | 201904 | 34.0 |
A | 202005 | 800 | 201901 | 201905 | 520.0 |
B | 202006 | 80 | 201901 | 201906 | 132.0 |
And I think there must be a SUMIF
equivalent from Excel within Pandas.
I.e. Sales PY for the last row would be SUM of sales WHERE store == 'B' AND yearweek >= 201901 AND yearweek <= 201906. This equals 132.
Because I can't ensure that my df will be arranged by store/week and I sometimes have weeks missing in my df I wouldn't prefer using the shift() and/or cumsum() function.
Upvotes: 2
Views: 6859
Reputation: 28709
The dates seem to be uniform for Stores A and B; we can use an inequality join to get the relevant rows, sum the values with a groupby before merging back to the original dataframe. The conditional_join from pyjanitor is helpful here for the non-equi merge, where we use binary search, instead of going through every row; depending on the data size, the performance may be helpful:
# pip install pyjanitor
import janitor
import pandas as pd
dates = df.filter(like = 'PY').drop_duplicates()
left = df.loc[:, :"sales"]
outcome = (
left.conditional_join(
dates,
("yearweek", "Start_PY", ">="),
("yearweek", "PY", "<="),
how="right",
)
.groupby(["Store", "Start_PY", "PY"])
.sales.sum()
)
# join back to the original dataframe
df.merge(
outcome.rename("Sales_PY"),
left_on=["Store", "Start_PY", "PY"],
right_index=True,
how="left",
)
Store yearweek sales Start_PY PY Sales_PY
0 A 201901 100 201801 201801 NaN
1 B 201902 11 201801 201802 NaN
2 B 201903 13 201801 201803 NaN
3 B 201905 18 201801 201805 NaN
4 A 201906 800 201801 201806 NaN
5 A 202001 100 201901 201901 100.0
6 A 202002 110 201901 201902 210.0
7 A 202003 130 201901 201903 340.0
8 A 202004 180 201901 201904 340.0
9 B 202005 80 201901 201905 52.0
10 A 202006 800 201901 201906 1320.0
11 B 201901 10 201801 201801 NaN
12 A 201902 110 201801 201802 NaN
13 A 201903 130 201801 201803 NaN
14 A 201905 180 201801 201805 NaN
15 B 201906 80 201801 201806 NaN
16 B 202001 10 201901 201901 10.0
17 B 202002 11 201901 201902 21.0
18 B 202003 13 201901 201903 34.0
19 B 202004 18 201901 201904 34.0
20 A 202005 800 201901 201905 520.0
21 B 202006 80 201901 201906 132.0
Upvotes: 0
Reputation: 8219
Note that the df you coded up is inconsistent with the df you printed in the table. I went with the one in the table
The below is not the most elegant but I cannot think of a more vectorized operation given missing weeks etc
We basically implement row-by-row calculation that follows sumif
logic quite closely. The function in apply
is applied to each row r
For each row r
it selects the relevant subset of the original dataframe df
and calculates the sum
df['Sales_PY'] = (df.apply(lambda r: df.loc[(df['yearweek'] >= r['Start_PY'])
&(df['yearweek'] <= r['PY'])
&(df['Store']==r['Store']) ,'sales'].sum(),axis=1)
)
output
Store yearweek sales Start_PY PY Sales_PY
-- ------- ---------- ------- ---------- ------ ----------
0 A 201901 100 201801 201801 0
1 B 201902 11 201801 201802 0
2 B 201903 13 201801 201803 0
3 B 201905 18 201801 201805 0
4 A 201906 800 201801 201806 0
5 A 202001 100 201901 201901 100
6 A 202002 110 201901 201902 210
7 A 202003 130 201901 201903 340
8 A 202004 180 201901 201904 340
9 B 202005 80 201901 201905 52
10 A 202006 800 201901 201906 1320
11 B 201901 10 201801 201801 0
12 A 201902 110 201801 201802 0
13 A 201903 130 201801 201803 0
14 A 201905 180 201801 201805 0
15 B 201906 80 201801 201806 0
16 B 202001 10 201901 201901 10
17 B 202002 11 201901 201902 21
18 B 202003 13 201901 201903 34
19 B 202004 18 201901 201904 34
20 A 202005 800 201901 201905 520
21 B 202006 80 201901 201906 132
If you want NaN
s instead of 0s where you do not have sales data you can pass min_count=1
parameter in the sum
above: .sum(min_count=1)
Upvotes: 3
Reputation: 16162
You can group them by store and shift them forward one row, then group again and take the cumulative sum.
import pandas as pd
import numpy as np
df = pd.DataFrame([['A', 4, 10, 3, 1],
['A', 5, 11, 4, 2],
['A', 6, 13, 5, 3],
['A', 7, 18, 6, 4],
['B', 4 ,80, 3, 1],
['B', 5, 78, 4, 2],
['B', 6, 71, 5, 3],
['B', 7, 80, 6, 4]],
columns = ['Store','week','sales','week_min_1','week_min_3'])
df['sales_last_3_weeks'] = df.groupby('Store')['sales'].shift()
df['sales_last_3_weeks'] = df.groupby('Store')['sales_last_3_weeks'].cumsum()
Upvotes: 1