big_soapy
big_soapy

Reputation: 137

Merge and match with conditions Pandas

I have two simple dataframes. I would like to merge the two where special_date >= first_date and <= second_date and is the largest possible date.

    ID   |  special_date | 
0   11   |   2019-04-06  |  
1   11   |   2019-04-09  |  
2   11   |   2019-06-03  |  
3   11   |   2019-03-11  |  

    ID   |   first_date  |  second_date |
0   11   |   2019-04-03  |  2019-04-09  |
1   11   |   2019-05-02  |  2019-05-14  |
2   11   |   2019-05-20  |  2019-06-05  |
3   11   |   2019-03-03  |  2019-03-07  |

Desired output:

    ID   |   first_date  | special_date |  second_date |
0   11   |   2019-04-03  |  2019-04-09  |  2019-04-09  |
1   11   |   2019-05-02  |      NaN     |  2019-05-14  |
2   11   |   2019-05-20  |  2019-06-03  |  2019-06-05  |
3   11   |   2019-03-03  |      NaN     |  2019-03-07  |

Upvotes: 2

Views: 182

Answers (3)

mozway
mozway

Reputation: 262634

Here is an approach of merge first and cleanup after. Using merge_asof to merge using "first_date" and then ensuring the value is below "second_date". prerequisite:

df1['special_date'] = pd.to_datetime(df1['special_date'])
df2['first_date'] = pd.to_datetime(df2['first_date'])
df2['second_date'] = pd.to_datetime(df2['second_date'])

processing:

df3 = pd.merge_asof(df1.sort_values(by='special_date'),
                    df2.sort_values(by='first_date'),
                    left_on='special_date',
                    right_on='first_date',
                    suffixes=['', '_drop']
                   ).drop(columns='ID_drop')
df3['special_date'] = df3['special_date'].where(df3['special_date']<df3['second_date'])

output:

   ID special_date first_date second_date
0  11          NaT 2019-03-03  2019-03-07
1  11   2019-04-06 2019-04-03  2019-04-09
2  11          NaT 2019-04-03  2019-04-09
3  11   2019-06-03 2019-05-20  2019-06-05

Upvotes: 1

SeaBean
SeaBean

Reputation: 23237

You can use:

Convert dates if not already in datetime format

df1['special_date'] = pd.to_datetime(df1['special_date'])

df2['first_date'] = pd.to_datetime(df2['first_date'])
df2['second_date'] = pd.to_datetime(df2['second_date'])

Then, merging, filtering, grouping and selection with largest possible date with .merge() + .query() + groupby() + .max(), as follows:

df_out = (df1.merge(df2, on='ID', how='right')
             .query('(special_date >= first_date) & (special_date <= second_date)')
             .groupby(['ID', 'first_date', 'second_date'], as_index=False)['special_date'].max()
             .merge(df2, on=['ID', 'first_date', 'second_date'], how='right')
         )

Result:

print(df_out)

   ID first_date second_date special_date
0  11 2019-04-03  2019-04-09   2019-04-09
1  11 2019-05-02  2019-05-14          NaT
2  11 2019-05-20  2019-06-05   2019-06-03
3  11 2019-03-03  2019-03-07          NaT

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195643

Try:

# convert columns if necessary:
df1["special_date"] = pd.to_datetime(df1["special_date"])
df2["first_date"] = pd.to_datetime(df2["first_date"])
df2["second_date"] = pd.to_datetime(df2["second_date"])

df2["tmp"] = df2.apply(
    lambda x: pd.date_range(x["first_date"], x["second_date"]), 1
)

df2 = (
    df2.explode("tmp")
    .merge(
        df1, left_on=["ID", "tmp"], right_on=["ID", "special_date"], how="outer"
    )
    .drop(columns="tmp")
)

print(df2.groupby(["ID", "first_date", "second_date"], as_index=False).max())

Prints:

   ID first_date second_date special_date
0  11 2019-03-03  2019-03-07          NaT
1  11 2019-04-03  2019-04-09   2019-04-09
2  11 2019-05-02  2019-05-14          NaT
3  11 2019-05-20  2019-06-05   2019-06-03

Upvotes: 0

Related Questions