Reputation: 31
I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table
Below is the two data sets DATA SET 1
Code 1 | Code 2 | Date | Number |
---|---|---|---|
001 | 192 | 02.02.22 | 10 |
002 | 192 | 05.03.22 | 12 |
002 | 192 | 09.05.22 | 8 |
003 | 193 | 14.06.22 | 14 |
003 | 193 | 16.08.22 | 18 |
DATA SET 2
Code 1 | Code 2 | Date Start | Date End |
---|---|---|---|
005 | 192 | 15.01.22 | 5.02.22 |
002 | 192 | 01.05.22 | 01.06.22 |
003 | 193 | 10.08.22 | 10.09.22 |
003 | 192 | 01.03.22 | 15.03.22 |
007 | 192 | 10.06.22 | 18.06.22 |
I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.
In this example above, the outcome would look like this:
Code 1 | Code 2 | Date Start | Date End | Number |
---|---|---|---|---|
002 | 192 | 01.05.22 | 01.06.22 | 8 |
003 | 193 | 10.08.22 | 10.09.22 | 18 |
Thanks
Upvotes: 2
Views: 8110
Reputation: 121
I've been going round StackOverflow looking at all the similar questions (here, here, here, here and here), and haven't found a satisfying answer, here's my solution.
use sqlite or equivalent, in memory, which makes it very easy to write an SQL query using the SQL "between" syntax on join. It works, but if you have strange datatypes in your dataframe it fails because it cannot convert them to SQL. See this answer
use some difficult numpy syntax, using piecewise(). But it's quite memory heavy. See this answer
use pd.IntervalIndex, although I am not sure it works in every case, see the comments at the bottom of this answer.
Do the merge twice, basically.
df1 = pandas.DataFrame({'Code 1': ['001','002','003','005'],
'Code 2': ['192','193','193','195'],
'Date': pandas.date_range("2018-02-01", periods = 4, freq = "D")
})
df2 = pandas.DataFrame({'Code 1': ['002','003','005','006'],
'Code 2': ['193','193','195','195'],
'Date Start': pandas.to_datetime("2018-01-01"),
'Date End': pandas.to_datetime("2018-02-04")
})
#we want to use the original index for the join, so we add it as a column (or multiple columns if multiindex)
#also, we only need to keep the columns useful for the merge, to save memory
df1_ = df1[['Code 1', 'Code 2', 'Date']].reset_index()
df2_ = df2[['Code 1', 'Code 2', 'Date Start', 'Date End']].reset_index()
#we create a helper df, which at first merges df1 and df2 ignoring dates
df_h = df1_.merge(df2_, on = ['Code 1', 'Code 2'], how = 'inner', suffixes = ['_df1', '_df2'])
#then we filter the helper df for only the rows where the dates match
df_h = df_h[(df_h['Date'] > df_h['Date Start']) &
(df_h['Date'] < df_h['Date End'])]
#and we keep only the index columns of df1 and df2, which hold the information about which row in df1 merges on which row in df2
df_h = df_h[['index_df1','index_df2']].set_index('index_df1')
df2 = df2_.rename(columns = {'index': 'match_index'})
df_h = df1.join(df_h).rename(columns = {'index_df2': 'match_index'})
#finally our resulting DataFrame, which merges on "Code 1", "Code 2" and on the "match index"
df = df_h.merge(df2, on = ['Code 1', 'Code 2', 'match_index']).drop(['match_index'], axis = 1)
Upvotes: 1
Reputation: 120559
Try:
# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)
# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
.query('Date.between(`Date Start`, `Date End`)')
Output:
Code 1 | Code 2 | Date Start | Date End | Date | Number |
---|---|---|---|---|---|
2 | 192 | 2022-05-01 00:00:00 | 2022-06-01 00:00:00 | 2022-05-09 00:00:00 | 8 |
3 | 193 | 2022-08-10 00:00:00 | 2022-09-10 00:00:00 | 2022-08-16 00:00:00 | 18 |
Upvotes: 1