Axel
Axel

Reputation: 2725

pandas - aggregate part of column to new value in new column

I have a large pandas dataframe df with warehouse data showing amounts of received items.

Imagine the relevant part of the structure as:

Date         SKU    received
2017-05-29   sku1   0
2017-05-30   sku1   0
2017-05-31   sku1   0
2017-06-01   sku1   0
2017-06-02   sku1   6
2017-06-03   sku1   2
2017-05-29   sku2   4
2017-05-30   sku2   4
2017-05-31   sku2   0
2017-06-01   sku2   0
2017-06-02   sku2   0
2017-06-03   sku2   24

From here I would like to reconstruct the order process. I know, that the stock level is reviewed on mondays, based on the stock level, new orders are placed. Orders arrive at the warehouse roughly a week later, sometimes split into multiple shippings.

I thought of creating an extra column for the weekdays (df["Weekday"]) and for the placed orders (df["Order"]). Based on the weekday, I would like to aggregate the data of the "received" column for the next 4 to 11 days, restricted to the relevant SKU.

The output could look like this:

Date         SKU    received    Weekday    Order
2017-05-29   sku1   0           0          8
2017-05-30   sku1   0           1          0
2017-05-31   sku1   0           2          0  
2017-06-01   sku1   0           3          0
2017-06-02   sku1   6           4          0
2017-06-03   sku1   2           5          0
2017-05-29   sku2   4           0          24
2017-05-30   sku2   4           1          0
2017-05-31   sku2   0           2          0
2017-06-01   sku2   0           3          0
2017-06-02   sku2   0           4          0
2017-06-03   sku2   24          5          0

Here is the code I tried:

import pandas as pd

# 0 is Monday, 1 is Tuesday, etc
df["Weekday"] = df["Date"].dt.dayofweek

# create new column for the orders
df["Order"] = 0

min_days = 4
max_days = min_days + 7

for i in range(len(df)):
    if df.loc[i, "Weekday"] == 0:
        df.loc[i, "Order"] = df.loc[(df.Date >= df.loc[i, "Date"] + pd.to_timedelta(min_days, unit="D")) &
                                    (df.Date < df.loc[i, "Date"] + pd.to_timedelta(max_days, unit="D")) &
                                    (df.SKU == df.loc[i, "SKU"]), "received"].sum()

It seems to do the job, but in a slow way. Maybe someone can help me find a more pythonic/pandas approach to save some computation time.

Thanks for your help.

Upvotes: 4

Views: 378

Answers (1)

FLab
FLab

Reputation: 7476

Here is a possible solution that uses pandas groupby and transform.

First idea is that you can achieve the count between two days by taking the difference of rolling sums. Also, note the trick of revert the order ([::-1]) twice in order to have a rolling sum picking days in the future.

def count_between(ts, min_days, max_days):
    return ts[::-1].pipe(lambda y: y.rolling(max_days,1).sum() - y.rolling(min_days-1,1).sum())[::-1]

This function would give you results for each day, so you restrict results to Mondays only setting all other entries to 0 (using [.where][1]).

After setting Date as index, you can do the following:

order = df.groupby('SKU')\
          .transform(lambda x: count_between(x, min_days, max_days)\
                               .where(lambda y: y.index.dayofweek==0, other = 0))
order.columns = ['Order']

This gives the expected result:

pd.concat([df, order], axis = 1)
Out[319]: 
             SKU  received  Order
Date                             
2017-05-29  sku1         0    8.0
2017-05-30  sku1         0    0.0
2017-05-31  sku1         0    0.0
2017-06-01  sku1         0    0.0
2017-06-02  sku1         6    0.0
2017-06-03  sku1         2    0.0
2017-05-29  sku2         4   24.0
2017-05-30  sku2         4    0.0
2017-05-31  sku2         0    0.0
2017-06-01  sku2         0    0.0
2017-06-02  sku2         0    0.0
2017-06-03  sku2        24    0.0

Upvotes: 1

Related Questions