user3415167
user3415167

Reputation: 1013

Faster way to add next row's column value to current row within groupby

I'm trying to add next row's column value to current row within groupby.

groupby_event_data = event_data.groupby(["user_id", "course_id", "lecture_id"])
pause_event_data = pd.DataFrame(columns=['real_time', 'next_real_time'])
n = 0
for each_group, eachdf in groupby_event_data:
    idx = 0
    for index, row in eachdf.iterrows():
        if row['event_from']=='pause_video':
            if idx < len(eachdf)-1:
                pause_event_data.loc[n] = [row.real_time, eachdf.iloc[idx+1]['real_time']] 
            else:
                pause_event_data.loc[n] = [row.real_time, None] 
            n = n+1
        idx = idx + 1

For example, with the below sample event_data:

 event_data:
      user_id  course_id  lecture_id  event_from   real_time
    0    a        c1          l1      pause_video  2018-07-24T19:44:25.869279+00:00
    1    a        c1          l1      pause_video  2018-07-24T19:46:25.869279+00:00    
    2    a        c2          l1      pause_video  2018-07-24T19:47:25.869279+00:00
    3    b        c2          l1      pause_video  2018-07-24T19:45:25.869279+00:00
    4    b        c2          l1      pause_video  2018-07-24T19:50:25.869279+00:00
    5    a        c2          l1      play_video   2018-07-24T19:52:25.869279+00:00
    6    a        c1          l1      pause_video  2018-07-24T19:54:25.869279+00:00

I want to get the below pause_event_data:

 pause_event_data:
       real_time                           next_real_time
    0  2018-07-24T19:44:25.869279+00:00   2018-07-24T19:46:25.869279+00:00
    1  2018-07-24T19:46:25.869279+00:00   2018-07-24T19:54:25.869279+00:00
    2  2018-07-24T19:47:25.869279+00:00   2018-07-24T19:52:25.869279+00:00
    3  2018-07-24T19:45:25.869279+00:00   2018-07-24T19:50:25.869279+00:00
    4  2018-07-24T19:50:25.869279+00:00   None
    5  2018-07-24T19:54:25.869279+00:00   None

But as my event_data is large, it's taking a long time.

So I was wondering if there's a faster way to run this.

Thanks for your help in advance!

Upvotes: 1

Views: 36

Answers (2)

jezrael
jezrael

Reputation: 862761

Use DataFrameGroupBy.shift with next processing - first sorting and then filtering only rows with pause_video:

cols = ["user_id", "course_id", "lecture_id"]
event_data['next_real_time'] = event_data.groupby(cols)['real_time'].shift(-1)
mask = event_data['event_from'].eq('pause_video')
event_data = (event_data.sort_values(cols)
                        .loc[mask, ['real_time','next_real_time']])
print (event_data)
                         real_time             next_real_time
0 2018-07-24 19:44:25.869279+00:00 2018-07-24 19:46:25.869279
1 2018-07-24 19:46:25.869279+00:00 2018-07-24 19:54:25.869279
6 2018-07-24 19:54:25.869279+00:00                        NaT
2 2018-07-24 19:47:25.869279+00:00 2018-07-24 19:52:25.869279
3 2018-07-24 19:45:25.869279+00:00 2018-07-24 19:50:25.869279
4 2018-07-24 19:50:25.869279+00:00                        NaT

Original solution:

groupby_event_data = event_data.groupby(["user_id", "course_id", "lecture_id"])
pause_event_data = pd.DataFrame(columns=['real_time', 'next_real_time'])
n = 0
for each_group, eachdf in groupby_event_data:
    idx = 0
    for index, row in eachdf.iterrows():
        if row['event_from']=='pause_video':
            if idx < len(eachdf)-1:
                pause_event_data.loc[n] = [row.real_time, eachdf.iloc[idx+1]['real_time']] 
            else:
                pause_event_data.loc[n] = [row.real_time, None] 
            n = n+1
        idx = idx + 1
print (pause_event_data)
                   real_time             next_real_time
0 2018-07-24 19:44:25.869279 2018-07-24 19:46:25.869279
1 2018-07-24 19:46:25.869279 2018-07-24 19:54:25.869279
2 2018-07-24 19:54:25.869279                        NaT
3 2018-07-24 19:47:25.869279 2018-07-24 19:52:25.869279
4 2018-07-24 19:45:25.869279 2018-07-24 19:50:25.869279
5 2018-07-24 19:50:25.869279                        NaT       

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30991

To access cells from previous / next row, the standard method is shift (see the documentation). You can use it also in groupby.

In your case, instead of your loop, you can run:

pause_event_data = event_data[['real_time']].copy()
pause_event_data['next_real_time'] = event_data.groupby(
    ["user_id", "course_id", "lecture_id"]).real_time.shift(-1)

Additional remark: Your data should actually be sorted by real_time (in your sample they aren't).

Upvotes: 2

Related Questions