rdk
rdk

Reputation: 449

pandas mark rows between events

Create data

actions = ['Start','Action1','Action2','Pause','Actoin2','Resume','Action1','Finish','Start','Action1','Finish']
start_date = datetime.datetime.strptime('14/10/21 09:00:00', '%d/%m/%y %H:%M:%S')
date_list = [start_date + datetime.timedelta(seconds=x) for x in range(0,11)]
values = [1,1,2,1,2,1,5,1,1,1,1]
df = pd.DataFrame({'ActionType': actions,
                   'Timestamp': date_list,
                   'Value': values})
ActionType Timestamp Value
Start 2021-10-14 09:00:00 1
Action1 2021-10-14 09:00:01 1
Action2 2021-10-14 09:00:02 2
Pause 2021-10-14 09:00:03 1
Action2 2021-10-14 09:00:04 2
Restart 2021-10-14 09:00:05 1
Action1 2021-10-14 09:00:06 5
Finish 2021-10-14 09:00:07 1
Start 2021-10-14 09:00:08 1
Action1 2021-10-14 09:00:09 1
Finish 2021-10-14 09:00:010 1

See how there are two "sessions" taking place. I want to mark each session in a new column.

  1. How do I get the rows in-between Start and Finish rows? (assume sorted)
  2. Similarly, how can I filter-out pauses in the session? For example, to calculate RealTimeElapsed column (or simply make a DuringPause boolean column)

The output should look like this:

output = pd.DataFrame({'Session:': [0,0,0,0,0,0,0,0,1,1,1],
                       'ActionType': actions,
                       'Timestamp': date_list,
                       'RealTimeElapsed': [0,1,2,3,3,3,4,5,0,1,2],
                       'Value': values
                      })
Session ActionType Timestamp RealTimeElapsed Value
0 Start 2021-10-14 09:00:00 0 1
0 Action1 2021-10-14 09:00:01 1 1
0 Action2 2021-10-14 09:00:02 2 1
0 Pause 2021-10-14 09:00:03 3 1
0 Action2 2021-10-14 09:00:04 3 1
0 Resume 2021-10-14 09:00:05 3 1
0 Action1 2021-10-14 09:00:06 4 1
0 Finish 2021-10-14 09:00:07 5 1
1 Start 2021-10-14 09:00:08 0 1
1 Action1 2021-10-14 09:00:09 1 1
1 Finish 2021-10-14 09:00:010 2 1

Already considered:

  1. Looping: this is a poor practice (my data is very large), but let me know if this is the only feasible solution.
  2. Shift: pandas has a shift function, but I only know how to use it for a fixed number of rows, there's no conditional things (e.g. start/finish/pause/resume) I can treat differently
  3. Groupby('Type').unstack() and take differences between times: I can't do this because I need to maintain the Value column

Upvotes: 0

Views: 143

Answers (1)

user16386186
user16386186

Reputation:

My solution is:

df.insert(0,"Session",np.where(df["ActionType"]=="Start",1,0).cumsum()-1)

def ufunc(df):
    a = list()
    for i,j in df.groupby(df.ActionType.isin(["Start","Pause","Resume"]).cumsum()):
        if j.ActionType.iloc[0] == "Start":
            a.extend(np.array(range(len(j))))
        elif j.ActionType.iloc[0] == "Pause":
            a.extend([max(a)+1] * len(j))
        elif j.ActionType.iloc[0] == "Resume":
            a.extend(np.array(range(len(j))) + max(a))
    return a

df.insert(len(df.columns)-1,"RealTimeElapsed",df.groupby("Session").apply(ufunc).explode().values)

df
    Session ActionType  Timestamp   RealTimeElapsed   Value
   0    0   Start    2021-10-14 09:00:00    0            1
   1    0   Action1  2021-10-14 09:00:01    1            1
   2    0   Action2  2021-10-14 09:00:02    2            2
   3    0   Pause    2021-10-14 09:00:03    3            1
   4    0   Actoin3  2021-10-14 09:00:04    3            2
   5    0   Resume   2021-10-14 09:00:05    3            1
   6    0   Action1  2021-10-14 09:00:06    4            5
   7    0   Finish   2021-10-14 09:00:07    5            1
   8    1   Start    2021-10-14 09:00:08    0            1
   9    1   Action1  2021-10-14 09:00:09    1            1
   10   1   Finish   2021-10-14 09:00:10    2            1

Upvotes: 1

Related Questions