Reputation: 57
I have a dataframe named grouped_train which lists the number of sales of each item in each store by month. Here's an example of the dataframe.
date_block_num | item_id | shop_id | item_cnt_month |
---|---|---|---|
0 | 32 | 0 | 6 |
0 | 32 | 5 | 1 |
0 | 26 | 1 | 3 |
0 | 26 | 18 | 9 |
1 | 32 | 46 | 1 |
1 | 26 | 50 | 2 |
There are 33 date_block_nums which correspond to different months. I'd like to add two columns which list the sum of all sales by item_id in the previous month/date_block_num as well as the mean of sales for that particular item_id from all store_ids in the previous month, any rows where date_block_num == 0 should be None.
So, using the example df above, the output would look like:
date_block_num | item_id | shop_id | item_cnt_month | item_sales_prev_month | mean_item_sales_prev_month |
---|---|---|---|---|---|
0 | 32 | 0 | 6 | None | None |
0 | 32 | 5 | 1 | None | None |
0 | 26 | 1 | 3 | None | None |
0 | 26 | 18 | 9 | None | None |
1 | 32 | 46 | 1 | 7 | 3.5 |
1 | 26 | 50 | 2 | 12 | 6 |
I've written some code for just the sum_item_prev_month column which I believe works and could easily change it to create the mean sales column as well, but with over 2.9 million rows in my dataframe, my code takes multiple hours to run. Admittedly, I'm not well versed with pandas, there has to be some vectorized formula I'm missing to speed up this computation. Here's the code I have so far.
sales_by_item_by_month = grouped_train.groupby(['date_block_num', 'item_id'], as_index=False).agg({'item_cnt_month' : 'sum'})
date_block_nums = list(grouped_train['date_block_num'])
item_ids = list(grouped_train['item_id'])
sales_for_item_prev_month = []
for index in range(len(item_ids)):
if date_block_nums[index] == 0:
sales_for_item_prev_month.append(None)
else:
sales = sales_by_item_by_month[(sales_by_item_by_month['item_id'] == item_ids[index]) & (sales_by_item_by_month['date_block_num'] == date_block_nums[index] - 1)]
if len(sales) == 0:
sales_for_item_prev_month.append(0)
else:
sales_for_item_prev_month.append(int(sales['item_cnt_month'].values))
grouped_train['item_sales_prev_month'] = sales_for_item_prev_month
Any advice would be much appreciated!
Upvotes: 1
Views: 145
Reputation: 35676
Assuming date_block_num
are sequential.
Try calculating the sum and mean using groupby agg
then increment the date_block_num
by 1 to align it to the next group:
sum_means = df.groupby(['date_block_num', 'item_id']).agg(
item_sales_prev_month=('item_cnt_month', 'sum'),
mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1
sum_means
:
date_block_num item_id item_sales_prev_month mean_item_sales_prev_month
0 1 26 12 6.0
1 1 32 7 3.5
2 2 26 2 2.0
3 2 32 1 1.0
Then merge
back to the original frame:
df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')
df
:
date_block_num item_id shop_id item_cnt_month item_sales_prev_month mean_item_sales_prev_month
0 0 32 0 6 NaN NaN
1 0 32 5 1 NaN NaN
2 0 26 1 3 NaN NaN
3 0 26 18 9 NaN NaN
4 1 32 46 1 7.0 3.5
5 1 26 50 2 12.0 6.0
Complete Code:
import pandas as pd
df = pd.DataFrame({
'date_block_num': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1},
'item_id': {0: 32, 1: 32, 2: 26, 3: 26, 4: 32, 5: 26},
'shop_id': {0: 0, 1: 5, 2: 1, 3: 18, 4: 46, 5: 50},
'item_cnt_month': {0: 6, 1: 1, 2: 3, 3: 9, 4: 1, 5: 2}
})
sum_means = df.groupby(['date_block_num', 'item_id']).agg(
item_sales_prev_month=('item_cnt_month', 'sum'),
mean_item_sales_prev_month=('item_cnt_month', 'mean')
).reset_index()
sum_means['date_block_num'] += 1
df = df.merge(sum_means, on=['date_block_num', 'item_id'], how='left')
Upvotes: 1