Reputation: 43
I am trying to "merge" two Dataframe based on overlapping intervals as below:
Dataset 1
start_date | end_date | field1 |
---|---|---|
2020-01-01 | 2020-06-30 | A |
2020-07-01 | 2020-12-31 | B |
Dataset 2
start_date | end_date | field2 |
---|---|---|
2020-01-01 | 2020-04-30 | D |
2020-05-01 | 2020-08-31 | E |
2020-09-01 | 2020-12-31 | F |
Combined Dataset
start_date | end_date | field1 | field2 |
---|---|---|---|
2020-01-01 | 2020-04-30 | A | D |
2020-05-01 | 2020-06-30 | A | E |
2020-07-01 | 2020-08-31 | B | E |
2020-09-01 | 2020-12-31 | B | F |
The code is python for the example would be the following
import pandas as pd
df1 = pd.DataFrame([['2020-01-01','2020-06-30','A'],
['2020-07-01','2020-12-31','B']],
columns = ['start_date', 'end_date', 'field1'])
df2 = pd.DataFrame([['2020-01-01','2020-04-30','D'],
['2020-05-01','2020-08-31','E'],
['2020-09-01','2020-12-31','F']],
columns = ['start_date', 'end_date', 'field2'])
expected_output = pd.DataFrame([['2020-01-01','2020-04-30','A','D'],
['2020-05-01','2020-06-30','A','E'],
['2020-07-01','2020-08-31','B','E'],
['2020-09-01','2020-12-31','B','F']],
columns = ['start_date', 'end_date','field1', 'field2'])
I have really tried to think about ways to do it but I must say my page is blank ... Thanks a lot in advance for any recommendation !
Upvotes: 4
Views: 401
Reputation: 23217
You can do it in these steps:
df1
, define date range for each row by pd.date_range()
corresponding to the period from start_date
to end_date
for each rowdf2
define date range for each row in similar waydate_range
for each of df1
and df2
into multiple rows with each date in a row.df1
and df2
on the date_range
columns in each dataframe. Now, we can already get the intersections of the common dates in each original dataframe for further processing.field1
and field2
on the intersection dates, we can get the new start_date
of the common date range by taking the first entry in the group.end_date
of the common date range by taking the last entry in the group.df1a = (df1.assign(date_range=df1.apply(lambda x: pd.date_range(start=x['start_date'], end=x['end_date']), axis=1))
.explode('date_range'))
df2a = (df2.assign(date_range=df2.apply(lambda x: pd.date_range(start=x['start_date'], end=x['end_date']), axis=1))
.explode('date_range'))
df3 = df1a.merge(df2a, on='date_range')
df3['start_date'] = df3.groupby(['field1', 'field2'])['date_range'].transform('first')
df3['end_date'] = df3.groupby(['field1', 'field2'])['date_range'].transform('last')
df4 = df3.groupby(['field1', 'field2']).agg('first').reset_index()[['start_date', 'end_date', 'field1', 'field2']]
print(df4)
start_date end_date field1 field2
0 2020-01-01 2020-04-30 A D
1 2020-05-01 2020-06-30 A E
2 2020-07-01 2020-08-31 B E
3 2020-09-01 2020-12-31 B F
Upvotes: 3