Reputation: 1023
I have the following dataframe of orders placed by different customers, at different times:
rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-
13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["12987"]*5 + ["89563"]*2 + ["56733"]
articles = ["8473", "7631", "1264", "8473", "5641", "9813", "7631", "1132"]
order_history = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles})
order_history
Output:
Customer_no Date Article_no
0 12987 2019-02-24 00:00:00 8473
1 12987 2019-02-24 00:01:00 7631
2 12987 2019-02-24 00:02:00 1264
3 12987 2019-02-24 00:03:00 8473
4 12987 2019-02-24 00:04:00 5641
5 89563 2019-03-13 00:00:00 9813
6 89563 2019-03-13 00:01:00 7631
7 56733 2019-02-27 00:00:00 1132
I would like to, for each customer and row, get the previous articles bought.
Expected output:
Customer_no Date Article_no Previous_articles
0 12987 2019-02-24 00:00:00 8473 []
1 12987 2019-02-24 00:01:00 7631 [8473]
2 12987 2019-02-24 00:02:00 1264 [8473, 7631]
3 12987 2019-02-24 00:03:00 8473 [8473, 7631, 1264]
4 12987 2019-02-24 00:04:00 5641 [8473, 7631, 1264, 8473]
5 89563 2019-03-13 00:00:00 9813 []
6 89563 2019-03-13 00:01:00 7631 [9813]
7 56733 2019-02-27 00:00:00 1132 []
I realize that I could iterate over each row with a custom function like order_history.apply(lambda x: my_custom_function(x), axis=1)
where my_custom_function(x)
would, for each row, filter through the entire order_history
to find the matching Customer_no
and appropriate dates. I also realize this solution would be highly inefficient, thus hoping that someone has any other ideas!
Upvotes: 1
Views: 46
Reputation: 672
Alternative solution may be using self-join:
df1 = df2 = order_history
df = df1.merge(df2, on='Customer_no')
df_ = df[df['Date_x']>= df['Date_y']]
df_final = df_.groupby(['Customer_no', 'Date_x'])[['Article_no_y']].agg(list).reset_index()
df_final['Previous_articles'] = df_final.Article_no_y.apply(lambda x: x[:-1] if len(x)>1 else [])
df_final.rename(columns={'Date_x':'Date'}, inplace=True)
df1.merge(df_final[['Customer_no', 'Date', 'Previous_articles']], on=['Customer_no', 'Date'])
Which return: Customer_no Date Article_no Previous_articles
0 12987 2019-02-24 00:00:00 8473 []
1 12987 2019-02-24 00:01:00 7631 [8473]
2 12987 2019-02-24 00:02:00 1264 [8473, 7631]
3 12987 2019-02-24 00:03:00 8473 [8473, 7631, 1264]
4 12987 2019-02-24 00:04:00 5641 [8473, 7631, 1264, 8473]
5 89563 2019-03-13 00:00:00 9813 []
6 89563 2019-03-13 00:01:00 7631 [9813]
7 56733 2019-02-27 00:00:00 1132 []
Upvotes: 1
Reputation: 1204
Input code:
import pandas as pd
import numpy as np
rng = list(pd.date_range('2019-02-24', periods=5, freq='T')) + list(pd.date_range('2019-03-13', periods=2, freq='T')) + list(pd.date_range('2019-02-27', periods=1, freq='T'))
customers = ["12987"]*5 + ["89563"]*2 + ["56733"]
articles = ["8473", "7631", "1264", "8473", "5641", "9813", "7631", "1132"]
order_history = pd.DataFrame({'Customer_no': customers, 'Date': rng, 'Article_no': articles})
order_history
Code to get Previous_articles
:
a = order_history['Article_no'].apply(lambda x: [] if pd.isnull(x) else [int(x)])
order_history['Previous_articles'] = a.groupby(order_history['Customer_no']).apply(lambda x: x.cumsum())
order_history["Previous_articles"] = order_history["Previous_articles"].apply(lambda x:x[:-1])
order_history
Output:
Customer_no Date Article_no Previous_articles
0 12987 2019-02-24 00:00:00 8473 []
1 12987 2019-02-24 00:01:00 7631 [8473]
2 12987 2019-02-24 00:02:00 1264 [8473, 7631]
3 12987 2019-02-24 00:03:00 8473 [8473, 7631, 1264]
4 12987 2019-02-24 00:04:00 5641 [8473, 7631, 1264, 8473]
5 89563 2019-03-13 00:00:00 9813 []
6 89563 2019-03-13 00:01:00 7631 [9813]
7 56733 2019-02-27 00:00:00 1132 []
Upvotes: 1