Reputation: 610
I have two Dataframes extracted from a large hotel database:
customer_id item date
1234 milk 2012-04-20
1234 sugar 2012-05-01
5678 salt 2017-07-15
5678 water 2017-08-10
customer_id start end visit
1234 2012-04-06 2012-04-25 1
5678 2017-07-10 2017-07-20 5
5678 2017-08-05 2017-08-11 6
I'm trying to find out the visit number for each item in the purchase history
customer_id item date visit
1234 milk 2012-04-20 1
1234 sugar 2012-05-01 null
5678 salt 2017-07-15 5
5678 water 2017-08-10 6
I tried using multiple for loops but it's not scalable given that df_visit has close to 6 million rows corresponding to around 15,000 unique customers. What would be a more efficient approach to solve this issue?
Upvotes: 3
Views: 694
Reputation: 71707
As the dataset is large you could not be able to perform merge operation directly as it will generate large number of unnecessary rows. Here is one way of solving the problem by using IntervalIndex
to create a mapping series for each group identified by the unique customer_id
in df_visit
:
def create_map():
mappings = []
for _, g in df_visit.groupby('customer_id', sort=False):
i = pd.IntervalIndex.from_arrays(g['start'], g['end'])
mappings.append(g.set_index(['customer_id', i])['visit'])
return pd.concat(mappings)
out = df_hist.set_index(['customer_id', 'date'])
out = out.assign(visit=out.index.map(create_map())).reset_index()
Alternate approach which can be used when df_visit
is already sorted on customer_id
:
def create_intervals():
intervals = []
for _, g in df_visit.groupby('customer_id', sort=False):
intervals.append(pd.IntervalIndex.from_arrays(g['start'], g['end']))
return intervals
i = create_intervals()
mapping = df_visit.set_index(['customer_id', np.hstack(i)])['visit']
df_hist['visit'] = df_hist.set_index(['customer_id', 'date']).index.map(mapping)
customer_id date item visit
0 1234 2012-04-20 milk 1.0
1 1234 2012-05-01 sugar NaN
2 5678 2017-07-15 salt 5.0
3 5678 2017-08-10 water 6.0
Upvotes: 1
Reputation: 16581
Here's one way:
import io
d1 = io.StringIO("""
customer_id item date
1234 milk 2012-04-20
1234 sugar 2012-05-01
5678 salt 2017-07-15
5678 water 2017-08-10
""")
d2 = io.StringIO("""
customer_id start end visit
1234 2012-04-06 2012-04-25 1
5678 2017-07-10 2017-07-20 5
5678 2017-08-05 2017-08-11 6
""")
import pandas as pd
df1 = pd.read_csv(d1, sep='\s+', parse_dates=['date'])
df2 = pd.read_csv(d2, sep='\s+', parse_dates=['start', 'end'])
merged = pd.merge_asof(df1, df2, left_on=['date'], right_on=['start'], by='customer_id', direction='backward')
mask_dates = (merged['end'] >= merged['date']) & (merged['date']>=merged['start'])
merged['visit'] = merged.loc[mask_dates, 'visit']
merged
Upvotes: 1
Reputation: 1
You could pre-process your input data to reduce set of items walked in each iteration. Since you know the customer ID in both input history data sets, that would be the natural choice to organize your data around.
Numerous implementation optimizations are possible in both steps. Say, how you keep the processed list after step one (copy out records or just keep index of an entry in big table etc). For step two, if further speedup is necessary you can organize visits into tries etc.
But even in the simplest case of for
loops, just by separating history of each customer you'll get 10000 speedup (at the expense of initial history processing).
Upvotes: 0