Saranya
Saranya

Reputation: 756

Compare current row value to previous row values

I have login history data from User A for a day. My requirement is that at any point in time the User A can have only one valid login. As in the samples below, the user may have attempted to login successfully multiple times, while his first session was still active. So, any logins that happened during the valid session needs to be flagged as duplicate.

Example 1:

In the first sample data below, while the user was still logged in from 00:12:38 to 01:00:02 (index 0), there is another login from the user at 00:55:14 to 01:00:02 (index 1).

Similarly, if we compare index 2 and 3, we can see that the record at index 3 is duplicate login as per requirement.

  start_time  end_time
0   00:12:38  01:00:02
1   00:55:14  01:00:02
2   01:00:02  01:32:40
3   01:00:02  01:08:40
4   01:41:22  03:56:23
5   18:58:26  19:16:49
6   20:12:37  20:52:49
7   20:55:16  22:02:50
8   22:21:24  22:48:50
9   23:11:30  00:00:00

Expected output:

  start_time  end_time   isDup
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

These duplicate records need to be updated to 1 at column isDup.


Example 2:

Another sample of data as below. Here, while the user was still logged in between 13:36:10 and 13:50:16, there were 3 additional sessions too that needs to be flagged.

  start_time  end_time
0   13:32:54  13:32:55
1   13:36:10  13:50:16
2   13:37:54  13:38:14
3   13:46:38  13:46:45
4   13:48:59  13:49:05
5   13:50:16  13:50:20
6   14:03:39  14:03:49
7   15:36:20  15:36:20
8   15:46:47  15:46:47

Expected output:

  start_time    end_time    isDup
0   13:32:54    13:32:55    0
1   13:36:10    13:50:16    0
2   13:37:54    13:38:14    1
3   13:46:38    13:46:45    1
4   13:48:59    13:49:05    1
5   13:50:16    13:50:20    0
6   14:03:39    14:03:49    0
7   15:36:20    15:36:20    0
8   15:46:47    15:46:47    0

What's the efficient way to compare the start time of the current record with previous records?

Upvotes: 8

Views: 1334

Answers (3)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Map the time like values in columns start_time and end_time to pandas TimeDelta objects and subtract 1 seconds from the 00:00:00 timedelta values in end_time column.

c = ['start_time', 'end_time']
s, e = df[c].astype(str).apply(pd.to_timedelta).to_numpy().T
e[e == pd.Timedelta(0)] += pd.Timedelta(days=1, seconds=-1)

Then for each pair of start_time and end_time in the dataframe df mark the corresponding duplicate intervals using numpy broadcasting:

m = (s[:, None] >= s) & (e[:, None] <= e)
np.fill_diagonal(m, False)
df['isDupe'] = (m.any(1) & ~df[c].duplicated(keep=False)).view('i1')

# example 1
  start_time  end_time  isDupe
0   00:12:38  01:00:02       0
1   00:55:14  01:00:02       1
2   01:00:02  01:32:40       0
3   01:00:02  01:08:40       1
4   01:41:22  03:56:23       0
5   18:58:26  19:16:49       0
6   20:12:37  20:52:49       0
7   20:55:16  22:02:50       0
8   22:21:24  22:48:50       0
9   23:11:30  00:00:00       0

# example 2
  start_time  end_time  isDupe
0   13:32:54  13:32:55       0
1   13:36:10  13:50:16       0
2   13:37:54  13:38:14       1
3   13:46:38  13:46:45       1
4   13:48:59  13:49:05       1
5   13:50:16  13:50:20       0
6   14:03:39  14:03:49       0
7   15:36:20  15:36:20       0
8   15:46:47  15:46:47       0

Upvotes: 2

Saranya
Saranya

Reputation: 756

Here's my solution to the above question. However, if there are any efficient way, I would be happy to accept it. Thanks!

def getDuplicate(data):
    data['check_time'] = data.iloc[-1]['start_time']
    data['isDup'] = data.apply(lambda x: 1 
                               if (x['start_time'] <= x['check_time']) & (x['check_time'] < x['end_time']) 
                               else 0 
                               , axis = 1)

    return data['isDup'].sum()

limit = 1
df_copy = df.copy()
df['isDup'] = 0

for i, row in df.iterrows():
    data = df_copy.iloc[:limit]
    isDup = getDuplicate(data)
    limit = limit + 1

    if isDup > 1:
        df.at[i, 'isDup'] = 1
    else:
        df.at[i, 'isDup'] = 0

Upvotes: 0

wwnde
wwnde

Reputation: 26676

Query duplicated() and change astype to int

df['isDup']=(df['Start time'].duplicated(False)|df['End time'].duplicated(False)).astype(int)

Or did you need

df['isDup']=(df['Start time'].between(df['Start time'].shift(),df['End time'].shift())).astype(int)

Upvotes: 3

Related Questions