TheLastCoder
TheLastCoder

Reputation: 610

Check if a list of dates falls between a list of date ranges

I have two Dataframes extracted from a large hotel database:

  1. A customer shopping history dataframe (df_hist)
    customer_id   item   date     
     1234         milk   2012-04-20       
     1234         sugar  2012-05-01      
     5678         salt   2017-07-15    
     5678         water  2017-08-10    
  1. A customer visit history dataframe (df_visit)
    customer_id   start          end         visit
     1234         2012-04-06     2012-04-25  1    
     5678         2017-07-10     2017-07-20  5
     5678         2017-08-05     2017-08-11  6    
   

I'm trying to find out the visit number for each item in the purchase history

  1. Result(df_result):
    customer_id   item   date         visit
     1234         milk   2012-04-20   1  
     1234         sugar  2012-05-01   null 
     5678         salt   2017-07-15   5 
     5678         water  2017-08-10   6 

I tried using multiple for loops but it's not scalable given that df_visit has close to 6 million rows corresponding to around 15,000 unique customers. What would be a more efficient approach to solve this issue?

Upvotes: 3

Views: 694

Answers (3)

Shubham Sharma
Shubham Sharma

Reputation: 71707

As the dataset is large you could not be able to perform merge operation directly as it will generate large number of unnecessary rows. Here is one way of solving the problem by using IntervalIndex to create a mapping series for each group identified by the unique customer_id in df_visit:

def create_map():
    mappings = []
    for _, g in df_visit.groupby('customer_id', sort=False):
        i = pd.IntervalIndex.from_arrays(g['start'], g['end'])
        mappings.append(g.set_index(['customer_id', i])['visit'])
    return pd.concat(mappings)

out = df_hist.set_index(['customer_id', 'date'])
out = out.assign(visit=out.index.map(create_map())).reset_index()

Alternate approach which can be used when df_visit is already sorted on customer_id:

def create_intervals():
    intervals = []
    for _, g in df_visit.groupby('customer_id', sort=False):
        intervals.append(pd.IntervalIndex.from_arrays(g['start'], g['end']))
    return intervals

i = create_intervals()
mapping = df_visit.set_index(['customer_id', np.hstack(i)])['visit']
df_hist['visit'] = df_hist.set_index(['customer_id', 'date']).index.map(mapping)

   customer_id       date   item  visit
0         1234 2012-04-20   milk    1.0
1         1234 2012-05-01  sugar    NaN
2         5678 2017-07-15   salt    5.0
3         5678 2017-08-10  water    6.0

Upvotes: 1

SultanOrazbayev
SultanOrazbayev

Reputation: 16581

Here's one way:

import io
d1 = io.StringIO("""
    customer_id   item   date     
     1234         milk   2012-04-20       
     1234         sugar  2012-05-01      
     5678         salt   2017-07-15    
     5678         water  2017-08-10    
""")

d2 = io.StringIO("""
  customer_id   start          end         visit
     1234         2012-04-06     2012-04-25  1    
     5678         2017-07-10     2017-07-20  5
     5678         2017-08-05     2017-08-11  6
""")

import pandas as pd

df1 = pd.read_csv(d1, sep='\s+', parse_dates=['date'])
df2 = pd.read_csv(d2, sep='\s+', parse_dates=['start', 'end'])

merged = pd.merge_asof(df1, df2, left_on=['date'], right_on=['start'], by='customer_id', direction='backward')

mask_dates = (merged['end'] >= merged['date']) & (merged['date']>=merged['start'])

merged['visit'] = merged.loc[mask_dates, 'visit']

merged

Upvotes: 1

Random thoughts
Random thoughts

Reputation: 1

You could pre-process your input data to reduce set of items walked in each iteration. Since you know the customer ID in both input history data sets, that would be the natural choice to organize your data around.

  • Step one: walk customer visit history and create history per customer ID.
  • Step two: for each shopping entry, lookup visit history for THE customer and make match in this much smaller subset.

Numerous implementation optimizations are possible in both steps. Say, how you keep the processed list after step one (copy out records or just keep index of an entry in big table etc). For step two, if further speedup is necessary you can organize visits into tries etc.

But even in the simplest case of for loops, just by separating history of each customer you'll get 10000 speedup (at the expense of initial history processing).

Upvotes: 0

Related Questions