Reputation: 2219
I have a tough problem I just can't solve. I have a millions of rows where I need to flag overlapping dates between the current and previous row. The rows are grouped by 'KEY' and within this grouping I need to flag rows that have 'Date1' which overlaps with the 'Date2' of the previous row.
An overlapping row is where Date1 of the second row is less than Date2 of the previous row, AND Date1 of the second row is greater than or equal to Date1 of previous row.
Simply put: if date1 of the second row falls between date1 and date2 of the previous row, flag both rows as an overlapping row. FYI, on any given row Date1 will never be greater than Date2.
prev row date1 <= second row date1 < prev row date2
The hard part that I can't figure out is that this step needs to be performed sequentially. That is if the second row in this grouping is flagged, then the next row (row 3) in this group will be compared to the first row (in this situation, the first row will also be flagged as overlapping along with row 2).
Here is a dataset:
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
df[['KEY','Date1','Date2']]
KEY Date1 Date2
0 100000003 2012-05-06 2012-05-28
1 100000009 2012-05-06 2012-06-10
2 100000009 2012-05-07 2012-06-15
3 100000009 2012-06-08 2012-06-29
4 100000009 2012-06-20 2012-06-21
5 100000034 2012-02-06 2012-03-05
6 100000034 2012-03-04 2012-05-06
7 100000034 2012-04-05 2012-05-06
Since there are millions of rows and the size of each group varies, I wrote a for loop that will only iterate the max amount of groupby KEYs..
for item in range(df.groupby('KEY')['KEY'].count().max()):
df['PrevDate1'] = df.groupby('KEY')['Date1'].shift(1)
df['PrevDate2'] = df.groupby('KEY')['Date2'].shift(1)
df['Overlapping_Hospitalizations'] = np.where(df['Date1'].between(df['PrevDate1'],df['PrevDate2']),'Y','N')
print("DONE")
df
This works for each previous KEY, but ALSO I need it to compare to the initial KEY that caused an overlap for that grouping.
Expected Result:
KEY Date1 Date2 OverlappingFlag
0 100000003 2012-05-06 2012-05-28 N
1 100000009 2012-05-06 2012-06-10 Y
2 100000009 2012-05-07 2012-06-15 Y
3 100000009 2012-06-08 2012-06-29 Y
4 100000009 2012-06-20 2012-06-21 Y
5 100000034 2012-02-06 2012-03-05 Y
6 100000034 2012-03-04 2012-05-06 Y
7 100000034 2012-04-05 2012-05-06 Y
EDIT: Both the overlapping rows need to be flagged. See Expected Results.
Final Answer:
for item in range(df.groupby('KEY')['KEY'].count().max()):
df['overlap'] = (((df['KEY'] == df['KEY'].shift()) & \
(df['Date1'] >= df['Date1'].shift(1)) & \
(df['Date1'] < df['Date2'].shift(1))) | \
((df['KEY'] == df['KEY'].shift(-1)) & \
(df['Date1'].shift(-1) >= df['Date1']) & \
(df['Date1'].shift(-1) < df['Date2'])))
Upvotes: 2
Views: 260
Reputation: 15738
It looks like the expected result in the question does not fit the definition:
The rows are grouped by 'KEY' and within this grouping I need to flag rows that have 'Date1' which overlaps with the 'Date2' of the previous row.
KEY Date1 Date2 OverlappingFlag
0 100000003 2012-05-06 2012-05-28 N
1 100000009 2012-05-06 2012-06-10 Y # probably not
2 100000009 2012-05-07 2012-06-15 Y
3 100000009 2012-06-08 2012-06-29 Y
4 100000009 2012-06-20 2012-06-21 Y
5 100000034 2012-02-06 2012-03-05 Y # probably not
6 100000034 2012-03-04 2012-05-06 Y
7 100000034 2012-04-05 2012-05-06 Y
An extension of @Evan's code for this case:
import pandas as pd
import numpy as np
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
# if KEY is already an index, df = df.reset_index()
# df.set_index('KEY', inplace = True)
# this is really the only part changed
df['overlap'] = ((df.KEY == df.KEY.shift()) & \
(df.Date1 < df.Date2.shift())) | \
((df.KEY == df.KEY.shift(-1)) & \
(df.Date2 < df.Date1.shift(-1)))
df.set_index('KEY', inplace = True)
Upvotes: 1
Reputation: 2151
I'm not totally clear on your logic or how important KEY is; what about this?
import pandas as pd
import numpy as np
df = pd.DataFrame({'KEY': ['100000003', '100000009', '100000009', '100000009', '100000009','100000034','100000034', '100000034'],
'Date1': [20120506, 20120506, 20120507,20120608,20120620,20120206,20120304,20120405],
'Date2': [20120528, 20120610, 20120615,20120629,20120621,20120305,20120506,20120506]})
df['Date1'] = pd.to_datetime(df["Date1"], format='%Y%m%d')
df['Date2'] = pd.to_datetime(df["Date2"], format='%Y%m%d')
df.sort_values(by=['KEY','Date1','Date2'], inplace=True)
df.set_index('KEY', inplace = True)
df['overlap'] = ((df.Date1 > df.Date1.shift()) & \
(df.Date1 < df.Date2.shift())) | \
((df.Date1 < df.Date1.shift(-1)) & \
(df.Date2 < df.Date2.shift(-1)))
Output:
Date1 Date2 overlap
KEY
100000003 2012-05-06 2012-05-28 False
100000009 2012-05-06 2012-06-10 True
100000009 2012-05-07 2012-06-15 True
100000009 2012-06-08 2012-06-29 True
100000009 2012-06-20 2012-06-21 True
100000034 2012-02-06 2012-03-05 True
100000034 2012-03-04 2012-05-06 True
100000034 2012-04-05 2012-05-06 True
Upvotes: 1