Reputation: 651
The following is the data I am having,
id name unused time
1 a 1 2/21/2017 18:01:31.168
1 a 2 2/21/2017 18:01:31.168
1 a 3 2/21/2017 18:11:44.054
1 a 4 2/21/2017 18:19:03.147
1 b 5 2/21/2017 18:19:03.147
1 b 6 2/21/2017 21:55:43.927
1 b 7 2/21/2017 22:10:29.699
1 b 8 2/21/2017 22:10:29.699
2 a 9 2/21/2017 23:36:30.239
2 a 10 2/21/2017 23:45:40.005
2 a 11 2/22/2017 00:05:43.466
2 a 12 2/22/2017 00:05:43.466
2 b 13 2/22/2017 00:16:00.646
2 b 14 2/22/2017 11:43:16.250
2 b 15 2/22/2017 11:43:16.250
2 b 16 2/22/2017 14:02:10.531
I want to group it with id, name and look for consecutive time stamps and create a flag for it. For example, the 1st and the 2nd row have same id, name and time so I want 1 for both the values where if it is not consecutive, it should be 0.
The following is the output I am trying to achieve,
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0
The following is my trying,
I am trying this for sorting it,
data.sort_values(['id', 'name', 'time'])
Then I want to group it,
data.sort_values(['id', 'name', 'time']).groupby(['id', 'name'])
But I am not able to create the flag after that. I am thinking of a solution where I can write a for loop and loop through all the values and check for the condition. But I am thinking there should be a efficient solution because I need to find it for million rows.
Can anybody help me in solving this?
Thanks
Upvotes: 1
Views: 341
Reputation: 765
col1=df1[['id','name','time']].astype(str).apply('*'.join,1)
col2=col1.ne(col1.shift()).cumsum()
df1.assign(flag=col2).groupby(col2).apply(lambda dd:dd.assign(flag=1) if len(dd)>=2 else dd.assign(flag=0))
out:
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0
Upvotes: 0
Reputation: 29690
One approach may be to just use shift
to compare one forward and one behind with your columns of interest.
eval_cols = df[['id', 'name', 'time']]
df['flag'] = ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
Demo
>>> ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
0 1
1 1
2 0
3 0
4 0
5 0
6 1
7 1
8 0
9 0
10 1
11 1
12 0
13 1
14 1
15 0
dtype: int32
Upvotes: 2