Reputation: 55
I'd like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:
df1 (for each company I have quarterly data):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2 (for each company I have event dates that can happen on any day):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
Ideal merged df -- rows with EventDates in df2 that are 3-6 months (i.e. 1 quarter) after DATADATEs in rows of df1 will be merged:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
I am trying to apply this related topic [ Merge pandas DataFrames based on irregular time intervals ] by adding start_time and end_time columns to df1 denoting 3 months (start_time) to 6 months (end_time) after DATADATE, then using np.searchsorted(), but this case is a bit trickier because I'd like to merge on a company-by-company basis.
Upvotes: 3
Views: 1647
Reputation: 55
This is my solution going off of the algorithm that Ami Tavory suggested below:
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time
#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()
#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
#define empty df to append data
df3 = pd.DataFrame()
#iterate through each company in df1, find
#that company in sorted df2, then for each
#DATADATE quarter of df1, bisect df2 in the
#correct locations (i.e. start_time to end_time)
for cmpny in unique_companies_df1:
if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company
selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range
df_right = selected_df2.loc[lo:hi].copy() #grab all rows with EventDates that fall within our date range
df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
df_right.loc[0,'company']=cmpny
temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
df3=df3.append(temp)
Upvotes: 2
Reputation: 76297
This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.
Algorithmically:
sort the larger of the dataframes according to the date
for each date in the smaller dataframe, use the bisect
module to find the relevant rows in the larger dataframe
For dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).
Upvotes: 2