Oli Dewes
Oli Dewes

Reputation: 31

Merge when date is between two dates Pandas

I'm looking for way in which I can merge a table on multiple conditions, one of which is when a date is between two dates in the other table

Below is the two data sets DATA SET 1

Code 1 Code 2 Date Number
001 192 02.02.22 10
002 192 05.03.22 12
002 192 09.05.22 8
003 193 14.06.22 14
003 193 16.08.22 18

DATA SET 2

Code 1 Code 2 Date Start Date End
005 192 15.01.22 5.02.22
002 192 01.05.22 01.06.22
003 193 10.08.22 10.09.22
003 192 01.03.22 15.03.22
007 192 10.06.22 18.06.22

I basically need to end up with Data Set 2 but with the Number column attached - merged on Code 1, Code 2, and when the date in DS1 is between the two dates in DS 2.

In this example above, the outcome would look like this:

Code 1 Code 2 Date Start Date End Number
002 192 01.05.22 01.06.22 8
003 193 10.08.22 10.09.22 18

Thanks

Upvotes: 2

Views: 8110

Answers (2)

ffi23
ffi23

Reputation: 121

I've been going round StackOverflow looking at all the similar questions (here, here, here, here and here), and haven't found a satisfying answer, here's my solution.

First, the best answers I found:

  • use sqlite or equivalent, in memory, which makes it very easy to write an SQL query using the SQL "between" syntax on join. It works, but if you have strange datatypes in your dataframe it fails because it cannot convert them to SQL. See this answer

  • use some difficult numpy syntax, using piecewise(). But it's quite memory heavy. See this answer

  • use pd.IntervalIndex, although I am not sure it works in every case, see the comments at the bottom of this answer.

My answer:

Do the merge twice, basically.

  1. Merge the first time on the non-date columns
  2. Remove all the out-of-range date rows, keeping track of each row's index
  3. Merge a second time on the correct indexes

We want to merge these two DataFrames, df1 and df2:

df1 = pandas.DataFrame({'Code 1': ['001','002','003','005'],
                        'Code 2': ['192','193','193','195'],
                        'Date': pandas.date_range("2018-02-01", periods = 4, freq = "D")
                       })

df2 = pandas.DataFrame({'Code 1': ['002','003','005','006'],
                        'Code 2': ['193','193','195','195'],
                        'Date Start': pandas.to_datetime("2018-01-01"),
                        'Date End': pandas.to_datetime("2018-02-04")
                       })

df1

df2


And here's my solution:

#we want to use the original index for the join, so we add it as a column (or multiple columns if multiindex)
#also, we only need to keep the columns useful for the merge, to save memory
df1_ = df1[['Code 1', 'Code 2', 'Date']].reset_index()
df2_ = df2[['Code 1', 'Code 2', 'Date Start', 'Date End']].reset_index()

#we create a helper df, which at first merges df1 and df2 ignoring dates
df_h = df1_.merge(df2_, on = ['Code 1', 'Code 2'], how = 'inner', suffixes = ['_df1', '_df2'])

#then we filter the helper df for only the rows where the dates match
df_h = df_h[(df_h['Date'] > df_h['Date Start']) & 
            (df_h['Date'] < df_h['Date End'])]

#and we keep only the index columns of df1 and df2, which hold the information about which row in df1 merges on which row in df2
df_h = df_h[['index_df1','index_df2']].set_index('index_df1')

df2 = df2_.rename(columns = {'index': 'match_index'})

df_h = df1.join(df_h).rename(columns = {'index_df2': 'match_index'})

#finally our resulting DataFrame, which merges on "Code 1", "Code 2" and on the "match index"
df = df_h.merge(df2, on = ['Code 1', 'Code 2', 'match_index']).drop(['match_index'], axis = 1)


Result DataFrame

result df

Upvotes: 1

Corralien
Corralien

Reputation: 120559

Try:

# Convert to datetime
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date Start'] = pd.to_datetime(df2['Date Start'], dayfirst=True)
df2['Date End'] = pd.to_datetime(df2['Date End'], dayfirst=True)

# Merge on Code 1 and Code 2 then keep only rows where Start Date <= Date <= End Date
out = df2.merge(df1, how='left', on=['Code 1', 'Code 2']) \
         .query('Date.between(`Date Start`, `Date End`)')

Output:

Code 1 Code 2 Date Start Date End Date Number
2 192 2022-05-01 00:00:00 2022-06-01 00:00:00 2022-05-09 00:00:00 8
3 193 2022-08-10 00:00:00 2022-09-10 00:00:00 2022-08-16 00:00:00 18

Upvotes: 1

Related Questions