Reputation: 65
I have some data in a pandas dataframe that contains a rank column, a start date and an end date. The data is sorted on the rank column lowest to highest (consequently the start/end dates are unordered). I wish to remove every row whose date range overlaps ANY PREVIOUS rows'
By way of a toy example:
Raw Data
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
2 1/15/2021 2/15/2021
3 12/7/2020 1/7/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
6 4/20/2020 5/20/2020
Desired Result
Rank Start_Date End_Date
1 1/1/2021 2/1/2021
4 5/1/2020 6/1/2020
5 7/10/2020 8/10/2020
Explanation: Row 2 is removed because its start overlaps Row 1, Row 3 is removed because its end overlaps Row 1. Row 4 is retained as it doesn’t overlap any previously retained Rows (ie Row 1). Similarly, Row 5 is retained as it doesn’t overlap Row 1 or Row 4. Row 6 is removed because it overlaps with Row 4.
Attempts:
df['overlap'] = np.where((df['start']> df['start'].shift(1)) &
(df['start'] < df['end'].shift(1)),1 ,0)
df['overlap'] = np.where((df['end'] < df['end'].shift(1)) &
(df['end'] > df['start'].shift(1)), 1, df['overlap'])
target = df.iloc[0]
day_diff = abs(target['End_Date'] - df['End_Date'])
day_diff = day_diff.reset_index().sort_values(['End_Date', 'index'])
day_diff.columns = ['old_index', 'End_Date']
non_overlap = day_diff.groupby(day_diff['End_Date'].dt.days // window).first().old_index.values
results = df.iloc[non_overlap]
Upvotes: 3
Views: 2072
Reputation: 28644
Another option, that could help with memory usage, is a combination of IntervalIndex
and a for loop:
Convert dates:
df.Start_Date = df.Start_Date.transform(pd.to_datetime, format='%m/%d/%Y')
df.End_Date = df.End_Date.transform(pd.to_datetime, format='%m/%d/%Y')
Create IntervalIndex:
intervals = pd.IntervalIndex.from_arrays(df.Start_Date,
df.End_Date,
closed='both')
Run a for loop (this avoids broadcasting, which while fast, can be memory intensive, depending on the array size):
index = np.arange(intervals.size)
keep = [] # indices of `df` to be retained
# get rid of the indices where the intervals overlap
for interval in intervals:
keep.append(index[0])
checks = intervals[index].overlaps(intervals[index[0]])
if checks.any():
index = index[~checks]
else:
break
if index.size == 0:
break
df.loc[keep]
Rank Start_Date End_Date
0 1 2021-01-01 2021-02-01
3 4 2020-05-01 2020-06-01
4 5 2020-07-10 2020-08-10
Upvotes: 3
Reputation: 260725
Two intervals overlap if (a) End2>Start1 and (b) Start2<End1:
We can use numpy.triu
to calculate those comparisons with the previous rows only:
a = np.triu(df['End_Date'].values>df['Start_Date'].values[:,None])
b = np.triu(df['Start_Date'].values<df['End_Date'].values[:,None])
The good rows are those that have only True on the diagonal for a&b
df[(a&b).sum(0)==1]
output:
Rank Start_Date End_Date
1 2021-01-01 2021-02-01
4 2020-05-01 2020-06-01
5 2020-07-10 2020-08-10
NB. as it needs to calculate the combination of rows, this method can use a lot of memory when the array becomes large, but it should be fast
Upvotes: 4