Pandas - Add columns to dataframe which finds the sum and mean of item sales by item id in the previous month

Question

I have a dataframe named grouped_train which lists the number of sales of each item in each store by month. Here's an example of the dataframe.

date_block_num	item_id	shop_id	item_cnt_month
0	32	0	6
0	32	5	1
0	26	1	3
0	26	18	9
1	32	46	1
1	26	50	2

There are 33 date_block_nums which correspond to different months. I'd like to add two columns which list the sum of all sales by item_id in the previous month/date_block_num as well as the mean of sales for that particular item_id from all store_ids in the previous month, any rows where date_block_num == 0 should be None.

So, using the example df above, the output would look like:

date_block_num	item_id	shop_id	item_cnt_month	item_sales_prev_month	mean_item_sales_prev_month
0	32	0	6	None	None
0	32	5	1	None	None
0	26	1	3	None	None
0	26	18	9	None	None
1	32	46	1	7	3.5
1	26	50	2	12	6

I've written some code for just the sum_item_prev_month column which I believe works and could easily change it to create the mean sales column as well, but with over 2.9 million rows in my dataframe, my code takes multiple hours to run. Admittedly, I'm not well versed with pandas, there has to be some vectorized formula I'm missing to speed up this computation. Here's the code I have so far.

sales_by_item_by_month = grouped_train.groupby(['date_block_num', 'item_id'], as_index=False).agg({'item_cnt_month' : 'sum'})

date_block_nums = list(grouped_train['date_block_num'])
item_ids = list(grouped_train['item_id'])

sales_for_item_prev_month = []

for index in range(len(item_ids)):
    if date_block_nums[index] == 0:
        sales_for_item_prev_month.append(None)
    else:
        sales = sales_by_item_by_month[(sales_by_item_by_month['item_id'] == item_ids[index]) & (sales_by_item_by_month['date_block_num'] == date_block_nums[index] - 1)]
        
        if len(sales) == 0:
            sales_for_item_prev_month.append(0)
        else:
            sales_for_item_prev_month.append(int(sales['item_cnt_month'].values))
        
grouped_train['item_sales_prev_month'] = sales_for_item_prev_month

Any advice would be much appreciated!

Henry Ecker · Accepted Answer

Assuming date_block_num are sequential.

Try calculating the sum and mean using groupby agg then increment the date_block_num by 1 to align it to the next group:

sum_means = df.groupby(['date_block_num', 'item_id']).agg(
    item_sales_prev_month=('item_cnt_month', 'sum'),
    mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1

sum_means:

   date_block_num  item_id  item_sales_prev_month  mean_item_sales_prev_month
0               1       26                     12                         6.0
1               1       32                      7                         3.5
2               2       26                      2                         2.0
3               2       32                      1                         1.0

Then merge back to the original frame:

df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')

df:

   date_block_num  item_id  shop_id  item_cnt_month  item_sales_prev_month  mean_item_sales_prev_month
0               0       32        0               6                    NaN                         NaN
1               0       32        5               1                    NaN                         NaN
2               0       26        1               3                    NaN                         NaN
3               0       26       18               9                    NaN                         NaN
4               1       32       46               1                    7.0                         3.5
5               1       26       50               2                   12.0                         6.0

Complete Code:

import pandas as pd

df = pd.DataFrame({
    'date_block_num': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1},
    'item_id': {0: 32, 1: 32, 2: 26, 3: 26, 4: 32, 5: 26},
    'shop_id': {0: 0, 1: 5, 2: 1, 3: 18, 4: 46, 5: 50},
    'item_cnt_month': {0: 6, 1: 1, 2: 3, 3: 9, 4: 1, 5: 2}
})

sum_means = df.groupby(['date_block_num', 'item_id']).agg(
    item_sales_prev_month=('item_cnt_month', 'sum'),
    mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1

df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')

Pandas - Add columns to dataframe which finds the sum and mean of item sales by item id in the previous month

Answers (1)

Related Questions