Efficient way of aggregating previous(in time) rows

Question

I have the following dataframe of orders placed by different customers, at different times:

rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-
13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["12987"]*5 + ["89563"]*2 + ["56733"]
articles = ["8473", "7631", "1264", "8473", "5641", "9813", "7631", "1132"]

order_history = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles}) 
order_history

Output:

    Customer_no Date                    Article_no
0   12987       2019-02-24 00:00:00     8473
1   12987       2019-02-24 00:01:00     7631
2   12987       2019-02-24 00:02:00     1264
3   12987       2019-02-24 00:03:00     8473
4   12987       2019-02-24 00:04:00     5641
5   89563       2019-03-13 00:00:00     9813
6   89563       2019-03-13 00:01:00     7631
7   56733       2019-02-27 00:00:00     1132

I would like to, for each customer and row, get the previous articles bought.
Expected output:

    Customer_no Date                    Article_no  Previous_articles
0   12987       2019-02-24 00:00:00     8473        []
1   12987       2019-02-24 00:01:00     7631        [8473]
2   12987       2019-02-24 00:02:00     1264        [8473, 7631]
3   12987       2019-02-24 00:03:00     8473        [8473, 7631, 1264]
4   12987       2019-02-24 00:04:00     5641        [8473, 7631, 1264, 8473]
5   89563       2019-03-13 00:00:00     9813        []
6   89563       2019-03-13 00:01:00     7631        [9813]
7   56733       2019-02-27 00:00:00     1132        []

I realize that I could iterate over each row with a custom function like order_history.apply(lambda x: my_custom_function(x), axis=1) where my_custom_function(x) would, for each row, filter through the entire order_history to find the matching Customer_no and appropriate dates. I also realize this solution would be highly inefficient, thus hoping that someone has any other ideas!

srishtigarg · Accepted Answer

Input code:

import pandas as pd
import numpy as np

rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["12987"]*5 + ["89563"]*2 + ["56733"]
articles = ["8473", "7631", "1264", "8473", "5641", "9813", "7631", "1132"]

order_history = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles}) 
order_history

Code to get Previous_articles:

a = order_history['Article_no'].apply(lambda x: [] if pd.isnull(x) else [int(x)])
order_history['Previous_articles'] = a.groupby(order_history['Customer_no']).apply(lambda x: x.cumsum())

order_history["Previous_articles"] = order_history["Previous_articles"].apply(lambda x:x[:-1])
order_history

Output:

  Customer_no                Date Article_no         Previous_articles
0       12987 2019-02-24 00:00:00       8473                        []
1       12987 2019-02-24 00:01:00       7631                    [8473]
2       12987 2019-02-24 00:02:00       1264              [8473, 7631]
3       12987 2019-02-24 00:03:00       8473        [8473, 7631, 1264]
4       12987 2019-02-24 00:04:00       5641  [8473, 7631, 1264, 8473]
5       89563 2019-03-13 00:00:00       9813                        []
6       89563 2019-03-13 00:01:00       7631                    [9813]
7       56733 2019-02-27 00:00:00       1132                        []

Efficient way of aggregating previous(in time) rows

Answers (2)

Related Questions