Pandas: Flag overlapping dates but exclude certain rows if condition is met

Question

I have a tough problem I just can't solve. I have a millions of rows where I need to flag overlapping dates between the current and previous row. The rows are grouped by 'KEY' and within this grouping I need to flag rows that have 'Date1' which overlaps with the 'Date2' of the previous row.

An overlapping row is where Date1 of the second row is less than Date2 of the previous row, AND Date1 of the second row is greater than or equal to Date1 of previous row.

Simply put: if date1 of the second row falls between date1 and date2 of the previous row, flag both rows as an overlapping row. FYI, on any given row Date1 will never be greater than Date2.

prev row date1 <= second row date1 < prev row date2

The hard part that I can't figure out is that this step needs to be performed sequentially. That is if the second row in this grouping is flagged, then the next row (row 3) in this group will be compared to the first row (in this situation, the first row will also be flagged as overlapping along with row 2).

Here is a dataset:

df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'], 
              'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
              'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
df[['KEY','Date1','Date2']]

    KEY         Date1       Date2
0   100000003   2012-05-06  2012-05-28
1   100000009   2012-05-06  2012-06-10
2   100000009   2012-05-07  2012-06-15
3   100000009   2012-06-08  2012-06-29
4   100000009   2012-06-20  2012-06-21
5   100000034   2012-02-06  2012-03-05
6   100000034   2012-03-04  2012-05-06
7   100000034   2012-04-05  2012-05-06

Since there are millions of rows and the size of each group varies, I wrote a for loop that will only iterate the max amount of groupby KEYs..

for item in range(df.groupby('KEY')['KEY'].count().max()):
    df['PrevDate1'] = df.groupby('KEY')['Date1'].shift(1)
    df['PrevDate2'] = df.groupby('KEY')['Date2'].shift(1)
    df['Overlapping_Hospitalizations'] = np.where(df['Date1'].between(df['PrevDate1'],df['PrevDate2']),'Y','N')
    print("DONE")
df

This works for each previous KEY, but ALSO I need it to compare to the initial KEY that caused an overlap for that grouping.

Expected Result:

    KEY         Date1       Date2       OverlappingFlag
0   100000003   2012-05-06  2012-05-28     N
1   100000009   2012-05-06  2012-06-10     Y
2   100000009   2012-05-07  2012-06-15     Y
3   100000009   2012-06-08  2012-06-29     Y
4   100000009   2012-06-20  2012-06-21     Y
5   100000034   2012-02-06  2012-03-05     Y
6   100000034   2012-03-04  2012-05-06     Y
7   100000034   2012-04-05  2012-05-06     Y

EDIT: Both the overlapping rows need to be flagged. See Expected Results.

Final Answer:

for item in range(df.groupby('KEY')['KEY'].count().max()):
    df['overlap'] = (((df['KEY'] == df['KEY'].shift())   & \
                  (df['Date1'] >= df['Date1'].shift(1))  & \
                  (df['Date1'] < df['Date2'].shift(1)))  | \
                 ((df['KEY'] == df['KEY'].shift(-1))     & \
                  (df['Date1'].shift(-1) >= df['Date1']) & \
                  (df['Date1'].shift(-1) < df['Date2'])))

Marat · Accepted Answer

It looks like the expected result in the question does not fit the definition:

The rows are grouped by 'KEY' and within this grouping I need to flag rows that have 'Date1' which overlaps with the 'Date2' of the previous row.

    KEY         Date1       Date2       OverlappingFlag
0   100000003   2012-05-06  2012-05-28     N
1   100000009   2012-05-06  2012-06-10     Y  # probably not
2   100000009   2012-05-07  2012-06-15     Y
3   100000009   2012-06-08  2012-06-29     Y
4   100000009   2012-06-20  2012-06-21     Y
5   100000034   2012-02-06  2012-03-05     Y  # probably not
6   100000034   2012-03-04  2012-05-06     Y
7   100000034   2012-04-05  2012-05-06     Y

An extension of @Evan's code for this case:

import pandas as pd
import numpy as np

df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'], 
              'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
              'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
# if KEY is already an index, df = df.reset_index()
# df.set_index('KEY', inplace = True)

# this is really the only part changed
df['overlap'] = ((df.KEY == df.KEY.shift()) & \
    (df.Date1 < df.Date2.shift())) | \
    ((df.KEY == df.KEY.shift(-1)) & \  
    (df.Date2 < df.Date1.shift(-1)))

df.set_index('KEY', inplace = True)

Pandas: Flag overlapping dates but exclude certain rows if condition is met

Answers (2)

Related Questions