I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table Below is the two data sets DATA SET 1 Code 1 Code 2 Date Number 001 192 02.02.22 10 002 192 05.03.22 12 002 192 09.05.22 8 003 193 14.06.22 14 003 193 16.08.22 18 DATA SET 2 Code 1 Code 2 Date Start Date End 005 192 15.01.22 5.02.22 002 192 01.05.22 01.06.22 003 193 10.08.22 10.09.22 003 192 01.03.22 15.03.22 007 192 10.06.22 18.06.22 I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2. In this example above, the outcome would look like this: Code 1 Code 2 Date Start Date End Number 002 192 01.05.22 01.06.22 8 003 193 10.08.22 10.09.22 18 Thanks

Reputation: 31

Merge when date is between two dates Pandas

I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table

Below is the two data sets DATA SET 1

Code 1	Code 2	Date	Number
001	192	02.02.22	10
002	192	05.03.22	12
002	192	09.05.22	8
003	193	14.06.22	14
003	193	16.08.22	18

DATA SET 2

Code 1	Code 2	Date Start	Date End
005	192	15.01.22	5.02.22
002	192	01.05.22	01.06.22
003	193	10.08.22	10.09.22
003	192	01.03.22	15.03.22
007	192	10.06.22	18.06.22

I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.

In this example above, the outcome would look like this:

Code 1	Code 2	Date Start	Date End	Number
002	192	01.05.22	01.06.22	8
003	193	10.08.22	10.09.22	18

Thanks

Upvotes: 2

Answers (2)

ffi23

Reputation: 121

I've been going round StackOverflow looking at all the similar questions (here, here, here, here and here), and haven't found a satisfying answer, here's my solution.

First, the best answers I found:

use sqlite or equivalent, in memory, which makes it very easy to write an SQL query using the SQL "between" syntax on join. It works, but if you have strange datatypes in your dataframe it fails because it cannot convert them to SQL. See this answer
use some difficult numpy syntax, using piecewise(). But it's quite memory heavy. See this answer
use pd.IntervalIndex, although I am not sure it works in every case, see the comments at the bottom of this answer.

My answer:

Do the merge twice, basically.

Merge the first time on the non-date columns
Remove all the out-of-range date rows, keeping track of each row's index
Merge a second time on the correct indexes

We want to merge these two DataFrames, df1 and df2:

df1 = pandas.DataFrame({'Code 1': ['001','002','003','005'],
                        'Code 2': ['192','193','193','195'],
                        'Date': pandas.date_range("2018-02-01", periods = 4, freq = "D")
                       })

df2 = pandas.DataFrame({'Code 1': ['002','003','005','006'],
                        'Code 2': ['193','193','195','195'],
                        'Date Start': pandas.to_datetime("2018-01-01"),
                        'Date End': pandas.to_datetime("2018-02-04")
                       })

And here's my solution:

#we want to use the original index for the join, so we add it as a column (or multiple columns if multiindex)
#also, we only need to keep the columns useful for the merge, to save memory
df1_ = df1[['Code 1', 'Code 2', 'Date']].reset_index()
df2_ = df2[['Code 1', 'Code 2', 'Date Start', 'Date End']].reset_index()

#we create a helper df, which at first merges df1 and df2 ignoring dates
df_h = df1_.merge(df2_, on = ['Code 1', 'Code 2'], how = 'inner', suffixes = ['_df1', '_df2'])

#then we filter the helper df for only the rows where the dates match
df_h = df_h[(df_h['Date'] > df_h['Date Start']) & 
            (df_h['Date'] < df_h['Date End'])]

#and we keep only the index columns of df1 and df2, which hold the information about which row in df1 merges on which row in df2
df_h = df_h[['index_df1','index_df2']].set_index('index_df1')

df2 = df2_.rename(columns = {'index': 'match_index'})

df_h = df1.join(df_h).rename(columns = {'index_df2': 'match_index'})

#finally our resulting DataFrame, which merges on "Code 1", "Code 2" and on the "match index"
df = df_h.merge(df2, on = ['Code 1', 'Code 2', 'match_index']).drop(['match_index'], axis = 1)

Result DataFrame

Upvotes: 1

Corralien

Reputation: 120559

Try:

# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)

# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
         .query('Date.between(`Date Start`, `Date End`)')

Output:

Code 1	Code 2	Date Start	Date End	Date	Number
2	192	2022-05-01 00:00:00	2022-06-01 00:00:00	2022-05-09 00:00:00	8
3	193	2022-08-10 00:00:00	2022-09-10 00:00:00	2022-08-16 00:00:00	18