Jrob1765
Jrob1765

Reputation: 57

Pandas - Add columns to dataframe which finds the sum and mean of item sales by item id in the previous month

I have a dataframe named grouped_train which lists the number of sales of each item in each store by month. Here's an example of the dataframe.

date_block_num item_id shop_id item_cnt_month
0 32 0 6
0 32 5 1
0 26 1 3
0 26 18 9
1 32 46 1
1 26 50 2

There are 33 date_block_nums which correspond to different months. I'd like to add two columns which list the sum of all sales by item_id in the previous month/date_block_num as well as the mean of sales for that particular item_id from all store_ids in the previous month, any rows where date_block_num == 0 should be None.

So, using the example df above, the output would look like:

date_block_num item_id shop_id item_cnt_month item_sales_prev_month mean_item_sales_prev_month
0 32 0 6 None None
0 32 5 1 None None
0 26 1 3 None None
0 26 18 9 None None
1 32 46 1 7 3.5
1 26 50 2 12 6

I've written some code for just the sum_item_prev_month column which I believe works and could easily change it to create the mean sales column as well, but with over 2.9 million rows in my dataframe, my code takes multiple hours to run. Admittedly, I'm not well versed with pandas, there has to be some vectorized formula I'm missing to speed up this computation. Here's the code I have so far.

sales_by_item_by_month = grouped_train.groupby(['date_block_num', 'item_id'], as_index=False).agg({'item_cnt_month' : 'sum'})

date_block_nums = list(grouped_train['date_block_num'])
item_ids = list(grouped_train['item_id'])

sales_for_item_prev_month = []

for index in range(len(item_ids)):
    if date_block_nums[index] == 0:
        sales_for_item_prev_month.append(None)
    else:
        sales = sales_by_item_by_month[(sales_by_item_by_month['item_id'] == item_ids[index]) & (sales_by_item_by_month['date_block_num'] == date_block_nums[index] - 1)]
        
        if len(sales) == 0:
            sales_for_item_prev_month.append(0)
        else:
            sales_for_item_prev_month.append(int(sales['item_cnt_month'].values))
        
grouped_train['item_sales_prev_month'] = sales_for_item_prev_month

Any advice would be much appreciated!

Upvotes: 1

Views: 145

Answers (1)

Henry Ecker
Henry Ecker

Reputation: 35676

Assuming date_block_num are sequential.

Try calculating the sum and mean using groupby agg then increment the date_block_num by 1 to align it to the next group:

sum_means = df.groupby(['date_block_num', 'item_id']).agg(
    item_sales_prev_month=('item_cnt_month', 'sum'),
    mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1

sum_means:

   date_block_num  item_id  item_sales_prev_month  mean_item_sales_prev_month
0               1       26                     12                         6.0
1               1       32                      7                         3.5
2               2       26                      2                         2.0
3               2       32                      1                         1.0

Then merge back to the original frame:

df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')

df:

   date_block_num  item_id  shop_id  item_cnt_month  item_sales_prev_month  mean_item_sales_prev_month
0               0       32        0               6                    NaN                         NaN
1               0       32        5               1                    NaN                         NaN
2               0       26        1               3                    NaN                         NaN
3               0       26       18               9                    NaN                         NaN
4               1       32       46               1                    7.0                         3.5
5               1       26       50               2                   12.0                         6.0

Complete Code:

import pandas as pd

df = pd.DataFrame({
    'date_block_num': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1},
    'item_id': {0: 32, 1: 32, 2: 26, 3: 26, 4: 32, 5: 26},
    'shop_id': {0: 0, 1: 5, 2: 1, 3: 18, 4: 46, 5: 50},
    'item_cnt_month': {0: 6, 1: 1, 2: 3, 3: 9, 4: 1, 5: 2}
})

sum_means = df.groupby(['date_block_num', 'item_id']).agg(
    item_sales_prev_month=('item_cnt_month', 'sum'),
    mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1

df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')

Upvotes: 1

Related Questions