J_Scholz
J_Scholz

Reputation: 506

Pandas - if condition is met in a row, add values to preceding rows without iteration

I am rather new to Pandas and face a quite complicated problem. As my solution is using many nested iteration-loops, I wonder if there is a faster and more "pandasic" way to do this.

I have a dataframe of events similar to this simplified version:

min  sec  isDone       sessionId
2    40   False        1
2    50   False        1
2    55   False        1
2    58   False        1
3    01   False        1
3    12   True         1
5    0    False        1
5    5    False        1
5    15   False        1
5    30   True         1
5    50   False        1
2    0    False        2
2    10   False        2
2    30   False        2
2    50   True         2

Now I want to add a column, that contains the seconds until the next "True" in the "isDone"-column up to a certain amount of seconds - but only within the same "sessionId". All other values would remain NaN.

For 20 seconds, this would look like this:

min  sec  isDone       sessionId  secToDone
2    40   False        1          NaN
2    50   False        1          NaN
2    55   False        1          17
2    58   False        1          14
3    01   False        1          11
3    12   True         1          0
5    0    False        1          NaN
5    5    False        1          NaN
5    15   False        1          15
5    30   True         1          0
5    50   False        1          NaN
2    0    False        2          NaN
2    10   False        2          NaN
2    30   False        2          20
2    50   True         2          0

My solution so far was:

  1. Iterate over sessionIds and select rows.
  2. Build a second dataframe df_done only with the "True"-values from this selection.
  3. Iterate over this df_done-Dataframe and select the preceding rows within 'sec' seconds.
  4. Iterate over these preceding rows and write values

Here's my code so far (iteration over sessionId is missing as I am testing this only for one session at the moment):

def get_preceding(df_dataset,sec=20):
  df_done = df_dataset[(df_dataset['isDone'] == True)]
  for row in df_done.itertuples():
      done_min = getattr(row, 'minute')
      done_sec = getattr(row, 'second')
      if done_sec < sec:
          pre_min = done_min -1
          pre_sec = 60 + done_sec - sec
      else:
          pre_min = done_min
          pre_sec = done_sec - sec


      for r in df_dataset.loc[((pre_min == df_dataset['minute']) & (pre_sec <= df_dataset['second'])) | ((pre_min < df_dataset['minute'])&(df_dataset['minute'] < done_min)) | ((df_dataset['minute'] == done_min) & (df_dataset['second'] <= done_sec))].itertuples():
          if r['minute'] == done_min:
              r['secToDone'] = done_sec - r['second']
          if r['minute'] < done_min:
              r['secToDone'] = 60 - r['second'] + done_sec + ((done_min - r['minute'] - 1)*60)

But this is a lot of iteration and the dataframe is quite big. So my question would be:

Is there a faster and more "pandasic" way to do this?

Upvotes: 0

Views: 536

Answers (1)

acushner
acushner

Reputation: 9946

first, you want to combine minutes and seconds into something reasonable:

df['t'] = df['min'] * 60 + df.sec

    min  sec  isDone  sessionId    t
0     2   40   False          1  160
1     2   50   False          1  170
2     2   55   False          1  175
3     2   58   False          1  178

then, you want to mark all the times where a True has occurred:

df['true_t'] = df[df.isDone].t

    min  sec  isDone  sessionId    t  true_t
0     2   40   False          1  160     NaN
1     2   50   False          1  170     NaN
2     2   55   False          1  175     NaN
3     2   58   False          1  178     NaN
4     3    1   False          1  181     NaN
5     3   12    True          1  192   192.0
6     5    0   False          1  300     NaN

now, the magic of groupby:

df['next_true_t'] = df.groupby('sessionId').true_t.bfill()

    min  sec  isDone  sessionId    t  true_t  next_true_t
0     2   40   False          1  160     NaN        192.0
1     2   50   False          1  170     NaN        192.0
2     2   55   False          1  175     NaN        192.0
3     2   58   False          1  178     NaN        192.0
4     3    1   False          1  181     NaN        192.0
5     3   12    True          1  192   192.0        192.0
6     5    0   False          1  300     NaN        330.0
7     5    5   False          1  305     NaN        330.0
8     5   15   False          1  315     NaN        330.0
9     5   30    True          1  330   330.0        330.0
10    5   50   False          1  350     NaN          NaN
11    2    0   False          2  120     NaN        170.0
12    2   10   False          2  130     NaN        170.0
13    2   30   False          2  150     NaN        170.0
14    2   50    True          2  170   170.0        170.0

now, it's trivial to calculate your diff:

df['diff'] = df.next_true_t - df.t

    min  sec  isDone  sessionId    t  true_t  next_true_t  diff
0     2   40   False          1  160     NaN        192.0  32.0
1     2   50   False          1  170     NaN        192.0  22.0
2     2   55   False          1  175     NaN        192.0  17.0
3     2   58   False          1  178     NaN        192.0  14.0
4     3    1   False          1  181     NaN        192.0  11.0
5     3   12    True          1  192   192.0        192.0   0.0
6     5    0   False          1  300     NaN        330.0  30.0
7     5    5   False          1  305     NaN        330.0  25.0
8     5   15   False          1  315     NaN        330.0  15.0
9     5   30    True          1  330   330.0        330.0   0.0
10    5   50   False          1  350     NaN          NaN   NaN
11    2    0   False          2  120     NaN        170.0  50.0
12    2   10   False          2  130     NaN        170.0  40.0
13    2   30   False          2  150     NaN        170.0  20.0
14    2   50    True          2  170   170.0        170.0   0.0

i'll leave it up to you to figure out how you want to omit values based on number of seconds, but it's pretty straightforward.

Upvotes: 1

Related Questions