Reputation: 506
I am rather new to Pandas and face a quite complicated problem. As my solution is using many nested iteration-loops, I wonder if there is a faster and more "pandasic" way to do this.
I have a dataframe of events similar to this simplified version:
min sec isDone sessionId
2 40 False 1
2 50 False 1
2 55 False 1
2 58 False 1
3 01 False 1
3 12 True 1
5 0 False 1
5 5 False 1
5 15 False 1
5 30 True 1
5 50 False 1
2 0 False 2
2 10 False 2
2 30 False 2
2 50 True 2
Now I want to add a column, that contains the seconds until the next "True" in the "isDone"-column up to a certain amount of seconds - but only within the same "sessionId". All other values would remain NaN.
For 20 seconds, this would look like this:
min sec isDone sessionId secToDone
2 40 False 1 NaN
2 50 False 1 NaN
2 55 False 1 17
2 58 False 1 14
3 01 False 1 11
3 12 True 1 0
5 0 False 1 NaN
5 5 False 1 NaN
5 15 False 1 15
5 30 True 1 0
5 50 False 1 NaN
2 0 False 2 NaN
2 10 False 2 NaN
2 30 False 2 20
2 50 True 2 0
My solution so far was:
Here's my code so far (iteration over sessionId is missing as I am testing this only for one session at the moment):
def get_preceding(df_dataset,sec=20):
df_done = df_dataset[(df_dataset['isDone'] == True)]
for row in df_done.itertuples():
done_min = getattr(row, 'minute')
done_sec = getattr(row, 'second')
if done_sec < sec:
pre_min = done_min -1
pre_sec = 60 + done_sec - sec
else:
pre_min = done_min
pre_sec = done_sec - sec
for r in df_dataset.loc[((pre_min == df_dataset['minute']) & (pre_sec <= df_dataset['second'])) | ((pre_min < df_dataset['minute'])&(df_dataset['minute'] < done_min)) | ((df_dataset['minute'] == done_min) & (df_dataset['second'] <= done_sec))].itertuples():
if r['minute'] == done_min:
r['secToDone'] = done_sec - r['second']
if r['minute'] < done_min:
r['secToDone'] = 60 - r['second'] + done_sec + ((done_min - r['minute'] - 1)*60)
But this is a lot of iteration and the dataframe is quite big. So my question would be:
Is there a faster and more "pandasic" way to do this?
Upvotes: 0
Views: 536
Reputation: 9946
first, you want to combine minutes and seconds into something reasonable:
df['t'] = df['min'] * 60 + df.sec
min sec isDone sessionId t
0 2 40 False 1 160
1 2 50 False 1 170
2 2 55 False 1 175
3 2 58 False 1 178
then, you want to mark all the times where a True
has occurred:
df['true_t'] = df[df.isDone].t
min sec isDone sessionId t true_t
0 2 40 False 1 160 NaN
1 2 50 False 1 170 NaN
2 2 55 False 1 175 NaN
3 2 58 False 1 178 NaN
4 3 1 False 1 181 NaN
5 3 12 True 1 192 192.0
6 5 0 False 1 300 NaN
now, the magic of groupby:
df['next_true_t'] = df.groupby('sessionId').true_t.bfill()
min sec isDone sessionId t true_t next_true_t
0 2 40 False 1 160 NaN 192.0
1 2 50 False 1 170 NaN 192.0
2 2 55 False 1 175 NaN 192.0
3 2 58 False 1 178 NaN 192.0
4 3 1 False 1 181 NaN 192.0
5 3 12 True 1 192 192.0 192.0
6 5 0 False 1 300 NaN 330.0
7 5 5 False 1 305 NaN 330.0
8 5 15 False 1 315 NaN 330.0
9 5 30 True 1 330 330.0 330.0
10 5 50 False 1 350 NaN NaN
11 2 0 False 2 120 NaN 170.0
12 2 10 False 2 130 NaN 170.0
13 2 30 False 2 150 NaN 170.0
14 2 50 True 2 170 170.0 170.0
now, it's trivial to calculate your diff:
df['diff'] = df.next_true_t - df.t
min sec isDone sessionId t true_t next_true_t diff
0 2 40 False 1 160 NaN 192.0 32.0
1 2 50 False 1 170 NaN 192.0 22.0
2 2 55 False 1 175 NaN 192.0 17.0
3 2 58 False 1 178 NaN 192.0 14.0
4 3 1 False 1 181 NaN 192.0 11.0
5 3 12 True 1 192 192.0 192.0 0.0
6 5 0 False 1 300 NaN 330.0 30.0
7 5 5 False 1 305 NaN 330.0 25.0
8 5 15 False 1 315 NaN 330.0 15.0
9 5 30 True 1 330 330.0 330.0 0.0
10 5 50 False 1 350 NaN NaN NaN
11 2 0 False 2 120 NaN 170.0 50.0
12 2 10 False 2 130 NaN 170.0 40.0
13 2 30 False 2 150 NaN 170.0 20.0
14 2 50 True 2 170 170.0 170.0 0.0
i'll leave it up to you to figure out how you want to omit values based on number of seconds, but it's pretty straightforward.
Upvotes: 1