crx91
crx91

Reputation: 483

How to check for interval overlap for grouped item on two dataframes?

I have two dataframes, df1 and df2, df1 has three columns - group, startdate1 and enddate1, and df2 also has three columns, group, startdate2 and enddate2. I'd like to compare for each group in df1, if the interval (startdate1,enddate1) overlaps with any interval of (startdate2,enddate2) for the same group.

I found this post(Is it possible to use Pandas Overlap in a Dataframe?) which used pandas.IntervalIndex.overlaps to check interval overlap. It's very similar to my question, but I'm struggling on how to use groupby for pandas.IntervalIndex.overlaps (or should I use other methods)? Below are some sample data:

df1:

group startdate1 enddate1
A 2017-07-01 2018-06-30
B 2017-07-01 2018-06-30
A 2018-07-01 2019-06-30
B 2019-07-01 2020-06-30

df2:

group startdate2 enddate2
A 2017-05-01 2018-04-30
A 2019-10-01 2020-01-31
B 2017-07-02 2018-06-29
B 2018-07-01 2019-06-30

The expected output is to add a column of 1 or 0 in df1 if there's any interval overlap with df2 for the same group. df_output:

group startdate1 enddate1 flag
A 2017-07-01 2018-06-30 1
B 2017-07-01 2018-06-30 1
A 2018-07-01 2019-06-30 0
B 2019-07-01 2020-06-30 0

Thank you!

Upvotes: 1

Views: 770

Answers (1)

perl
perl

Reputation: 9941

You can make cartesian join within groups, find indexes of records in df1 that overlap by date range with df2, and then add flag by checking if the index of a record is in that list:

ixs = (df1.reset_index().merge(df2, on=['group'])
         .query('(startdate1 < enddate2) & (enddate1 > startdate2)'))['index']

df1.assign(flag=df1.index.isin(ixs).astype(int))

Output:

  group startdate1   enddate1  flag
0     A 2017-07-01 2018-06-30     1
1     B 2017-07-01 2018-06-30     1
2     A 2018-07-01 2019-06-30     0
3     B 2019-07-01 2020-06-30     0

P.S. I'm assuming all dates are in datetime format already, otherwise we need to pd.to_datetime(...) those columns first

Upvotes: 3

Related Questions