Reputation: 756
I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.
Example 1:
In the first sample data below, while the user was still logged in from 00:12:38
to 01:00:02 (index 0)
, there is another login from the user at 00:55:14
to 01:00:02 (index 1)
.
Similarly, if we compare index 2
and 3
, we can see that the record at index 3
is duplicate login as per requirement.
start_time end_time
0 00:12:38 01:00:02
1 00:55:14 01:00:02
2 01:00:02 01:32:40
3 01:00:02 01:08:40
4 01:41:22 03:56:23
5 18:58:26 19:16:49
6 20:12:37 20:52:49
7 20:55:16 22:02:50
8 22:21:24 22:48:50
9 23:11:30 00:00:00
Expected output:
start_time end_time isDup
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
These duplicate records need to be updated to 1 at column isDup
.
Example 2:
Another sample of data as below. Here, while the user was still logged in between 13:36:10
and 13:50:16
, there were 3 additional sessions too that needs to be flagged.
start_time end_time
0 13:32:54 13:32:55
1 13:36:10 13:50:16
2 13:37:54 13:38:14
3 13:46:38 13:46:45
4 13:48:59 13:49:05
5 13:50:16 13:50:20
6 14:03:39 14:03:49
7 15:36:20 15:36:20
8 15:46:47 15:46:47
Expected output:
start_time end_time isDup
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
What's the efficient way to compare the start time of the current record with previous records?
Upvotes: 8
Views: 1334
Reputation: 71689
Map the time
like values in columns start_time
and end_time
to pandas TimeDelta
objects and subtract 1 seconds
from the 00:00:00
timedelta values in end_time
column.
c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)
Then for each pair of start_time
and end_time
in the dataframe df
mark the corresponding duplicate intervals using numpy broadcasting
:
m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')
# example 1
start_time end_time isDupe
0 00:12:38 01:00:02 0
1 00:55:14 01:00:02 1
2 01:00:02 01:32:40 0
3 01:00:02 01:08:40 1
4 01:41:22 03:56:23 0
5 18:58:26 19:16:49 0
6 20:12:37 20:52:49 0
7 20:55:16 22:02:50 0
8 22:21:24 22:48:50 0
9 23:11:30 00:00:00 0
# example 2
start_time end_time isDupe
0 13:32:54 13:32:55 0
1 13:36:10 13:50:16 0
2 13:37:54 13:38:14 1
3 13:46:38 13:46:45 1
4 13:48:59 13:49:05 1
5 13:50:16 13:50:20 0
6 14:03:39 14:03:49 0
7 15:36:20 15:36:20 0
8 15:46:47 15:46:47 0
Upvotes: 2
Reputation: 756
Here's my solution to the above question. However, if there are any efficient way, I would be happy to accept it. Thanks!
def getDuplicate(data):
data['check_time'] = data.iloc[-1]['start_time']
data['isDup'] = data.apply(lambda x: 1
if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time'])
else 0
, axis = 1)
return data['isDup'].sum()
limit = 1
df_copy = df.copy()
df['isDup'] = 0
for i, row in df.iterrows():
data = df_copy.iloc[:limit]
isDup = getDuplicate(data)
limit = limit + 1
if isDup > 1:
df.at[i, 'isDup'] = 1
else:
df.at[i, 'isDup'] = 0
Upvotes: 0
Reputation: 26676
Query duplicated()
and change astype to int
df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)
Or did you need
df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)
Upvotes: 3