Reputation: 829
I have the following dataframe
df = pd.DataFrame({'user':['Dan','Dan','Dan','Dan','Ron'], 'start':['2020-01-01 17:00:00',
'2020-01-01 16:20:00','2020-01-01 17:00:00', '2020-01-01 06:30:00','2020-01-01 17:00:00'],
'end':['2020-01-01 21:00:00', '2020-01-02 01:00:00','2020-01-01 21:15:00',
'2020-01-01 10:00:00','2020-01-01 21:00:00']})
user | start | end |
---|---|---|
Dan | 2020-01-01 17:00:00 | 2020-01-01 21:00:00 |
Dan | 2020-01-01 16:20:00 | 2020-01-02 01:00:00 |
Dan | 2020-01-01 17:00:00 | 2020-01-01 21:15:00 |
Dan | 2020-01-01 06:30:00 | 2020-01-01 10:00:00 |
Ron | 2020-01-01 17:00:00 | 2020-01-01 21:00:00 |
For the same user, I would like to leave just one of the records which have overlapping start-end time intervals (no matter which) for example:
user | start | end |
---|---|---|
Dan | 2020-01-01 17:00:00 | 2020-01-01 21:00:00 |
Dan | 2020-01-01 06:30:00 | 2020-01-01 10:00:00 |
Ron | 2020-01-01 17:00:00 | 2020-01-01 21:00:00 |
My try, for a given user to run:
idx, single_user, single_start, single_end = df.to_records()[0]
a = df.loc[((df['user'] == single_user) & ((df['start'] < single_start) & (df['end']< single_end))
|((df['start']> single_start) &(df['end'] > single_end)))].index.tolist()
a.append(df.iloc[idx].name)
And obtain the result using:
test_df.iloc[a]
There must be a better way, is there a pandas method to tackle it?
Upvotes: 1
Views: 59
Reputation: 261840
Here is a potential approach:
# ensure datetime type for meaningful comparisons
df['end'] = pd.to_datetime(df['end'])
df['start'] = pd.to_datetime(df['start'])
# sort by start time
df = df.sort_values(by='start')
# valid data have a start time greater than the previous end time
# or no previous time (NaT in the shift)
s = df.groupby('user')['end'].shift()
df[df['start'].gt(s)|s.isna()]
Output:
user start end
3 Dan 2020-01-01 06:30:00 2020-01-01 10:00:00
1 Dan 2020-01-01 16:20:00 2020-01-02 01:00:00
4 Ron 2020-01-01 17:00:00 2020-01-01 21:00:00
Upvotes: 1