Cambyst
Cambyst

Reputation: 43

Joining two datasets based on overlapping time intervals

I am trying to "merge" two Dataframe based on overlapping intervals as below:

Dataset 1

start_date end_date field1
2020-01-01 2020-06-30 A
2020-07-01 2020-12-31 B

Dataset 2

start_date end_date field2
2020-01-01 2020-04-30 D
2020-05-01 2020-08-31 E
2020-09-01 2020-12-31 F

Combined Dataset

start_date end_date field1 field2
2020-01-01 2020-04-30 A D
2020-05-01 2020-06-30 A E
2020-07-01 2020-08-31 B E
2020-09-01 2020-12-31 B F

The code is python for the example would be the following

import pandas as pd
df1 = pd.DataFrame([['2020-01-01','2020-06-30','A'],
                    ['2020-07-01','2020-12-31','B']], 
                     columns = ['start_date', 'end_date', 'field1'])

df2 = pd.DataFrame([['2020-01-01','2020-04-30','D'],
                    ['2020-05-01','2020-08-31','E'],
                    ['2020-09-01','2020-12-31','F']], 
                     columns = ['start_date', 'end_date', 'field2'])


expected_output = pd.DataFrame([['2020-01-01','2020-04-30','A','D'],
                                ['2020-05-01','2020-06-30','A','E'],
                                ['2020-07-01','2020-08-31','B','E'],
                                ['2020-09-01','2020-12-31','B','F']], 
                                columns = ['start_date', 'end_date','field1', 'field2'])

I have really tried to think about ways to do it but I must say my page is blank ... Thanks a lot in advance for any recommendation !

Upvotes: 4

Views: 401

Answers (1)

SeaBean
SeaBean

Reputation: 23217

You can do it in these steps:

  1. For df1, define date range for each row by pd.date_range() corresponding to the period from start_date to end_date for each row
  2. Similarly, for df2 define date range for each row in similar way
  3. Explode the list of dates in newly created date_range for each of df1 and df2 into multiple rows with each date in a row.
  4. Perform an inner merge on df1 and df2 on the date_range columns in each dataframe. Now, we can already get the intersections of the common dates in each original dataframe for further processing.
  5. Group by field1 and field2 on the intersection dates, we can get the new start_date of the common date range by taking the first entry in the group.
  6. Similarly, we can get the new end_date of the common date range by taking the last entry in the group.
  7. Finally, we aggregate the entries and take only the first row in each group which already has all the required information we want.

df1a = (df1.assign(date_range=df1.apply(lambda x: pd.date_range(start=x['start_date'], end=x['end_date']), axis=1))
           .explode('date_range'))
df2a = (df2.assign(date_range=df2.apply(lambda x: pd.date_range(start=x['start_date'], end=x['end_date']), axis=1))
           .explode('date_range'))
df3 = df1a.merge(df2a, on='date_range')
df3['start_date'] = df3.groupby(['field1', 'field2'])['date_range'].transform('first')
df3['end_date'] = df3.groupby(['field1', 'field2'])['date_range'].transform('last')
df4 = df3.groupby(['field1', 'field2']).agg('first').reset_index()[['start_date', 'end_date', 'field1', 'field2']]



print(df4)

  start_date   end_date field1 field2
0 2020-01-01 2020-04-30      A      D
1 2020-05-01 2020-06-30      A      E
2 2020-07-01 2020-08-31      B      E
3 2020-09-01 2020-12-31      B      F

Upvotes: 3

Related Questions