Reputation: 137
I have two simple dataframes. I would like to merge the two where special_date >= first_date and <= second_date and is the largest possible date.
ID | special_date |
0 11 | 2019-04-06 |
1 11 | 2019-04-09 |
2 11 | 2019-06-03 |
3 11 | 2019-03-11 |
ID | first_date | second_date |
0 11 | 2019-04-03 | 2019-04-09 |
1 11 | 2019-05-02 | 2019-05-14 |
2 11 | 2019-05-20 | 2019-06-05 |
3 11 | 2019-03-03 | 2019-03-07 |
Desired output:
ID | first_date | special_date | second_date |
0 11 | 2019-04-03 | 2019-04-09 | 2019-04-09 |
1 11 | 2019-05-02 | NaN | 2019-05-14 |
2 11 | 2019-05-20 | 2019-06-03 | 2019-06-05 |
3 11 | 2019-03-03 | NaN | 2019-03-07 |
Upvotes: 2
Views: 182
Reputation: 262634
Here is an approach of merge first and cleanup after. Using merge_asof
to merge using "first_date" and then ensuring the value is below "second_date".
prerequisite:
df1['special_date'] = pd.to_datetime(df1['special_date'])
df2['first_date'] = pd.to_datetime(df2['first_date'])
df2['second_date'] = pd.to_datetime(df2['second_date'])
processing:
df3 = pd.merge_asof(df1.sort_values(by='special_date'),
df2.sort_values(by='first_date'),
left_on='special_date',
right_on='first_date',
suffixes=['', '_drop']
).drop(columns='ID_drop')
df3['special_date'] = df3['special_date'].where(df3['special_date']<df3['second_date'])
output:
ID special_date first_date second_date
0 11 NaT 2019-03-03 2019-03-07
1 11 2019-04-06 2019-04-03 2019-04-09
2 11 NaT 2019-04-03 2019-04-09
3 11 2019-06-03 2019-05-20 2019-06-05
Upvotes: 1
Reputation: 23237
You can use:
Convert dates if not already in datetime format
df1['special_date'] = pd.to_datetime(df1['special_date'])
df2['first_date'] = pd.to_datetime(df2['first_date'])
df2['second_date'] = pd.to_datetime(df2['second_date'])
Then, merging, filtering, grouping and selection with largest possible date with .merge()
+ .query()
+ groupby()
+ .max()
, as follows:
df_out = (df1.merge(df2, on='ID', how='right')
.query('(special_date >= first_date) & (special_date <= second_date)')
.groupby(['ID', 'first_date', 'second_date'], as_index=False)['special_date'].max()
.merge(df2, on=['ID', 'first_date', 'second_date'], how='right')
)
Result:
print(df_out)
ID first_date second_date special_date
0 11 2019-04-03 2019-04-09 2019-04-09
1 11 2019-05-02 2019-05-14 NaT
2 11 2019-05-20 2019-06-05 2019-06-03
3 11 2019-03-03 2019-03-07 NaT
Upvotes: 0
Reputation: 195643
Try:
# convert columns if necessary:
df1["special_date"] = pd.to_datetime(df1["special_date"])
df2["first_date"] = pd.to_datetime(df2["first_date"])
df2["second_date"] = pd.to_datetime(df2["second_date"])
df2["tmp"] = df2.apply(
lambda x: pd.date_range(x["first_date"], x["second_date"]), 1
)
df2 = (
df2.explode("tmp")
.merge(
df1, left_on=["ID", "tmp"], right_on=["ID", "special_date"], how="outer"
)
.drop(columns="tmp")
)
print(df2.groupby(["ID", "first_date", "second_date"], as_index=False).max())
Prints:
ID first_date second_date special_date
0 11 2019-03-03 2019-03-07 NaT
1 11 2019-04-03 2019-04-09 2019-04-09
2 11 2019-05-02 2019-05-14 NaT
3 11 2019-05-20 2019-06-05 2019-06-03
Upvotes: 0