Reputation: 1013
I'm trying to add next row's column value to current row within groupby
.
groupby_event_data = event_data.groupby(["user_id", "course_id", "lecture_id"])
pause_event_data = pd.DataFrame(columns=['real_time', 'next_real_time'])
n = 0
for each_group, eachdf in groupby_event_data:
idx = 0
for index, row in eachdf.iterrows():
if row['event_from']=='pause_video':
if idx < len(eachdf)-1:
pause_event_data.loc[n] = [row.real_time, eachdf.iloc[idx+1]['real_time']]
else:
pause_event_data.loc[n] = [row.real_time, None]
n = n+1
idx = idx + 1
For example, with the below sample event_data
:
event_data:
user_id course_id lecture_id event_from real_time
0 a c1 l1 pause_video 2018-07-24T19:44:25.869279+00:00
1 a c1 l1 pause_video 2018-07-24T19:46:25.869279+00:00
2 a c2 l1 pause_video 2018-07-24T19:47:25.869279+00:00
3 b c2 l1 pause_video 2018-07-24T19:45:25.869279+00:00
4 b c2 l1 pause_video 2018-07-24T19:50:25.869279+00:00
5 a c2 l1 play_video 2018-07-24T19:52:25.869279+00:00
6 a c1 l1 pause_video 2018-07-24T19:54:25.869279+00:00
I want to get the below pause_event_data
:
pause_event_data:
real_time next_real_time
0 2018-07-24T19:44:25.869279+00:00 2018-07-24T19:46:25.869279+00:00
1 2018-07-24T19:46:25.869279+00:00 2018-07-24T19:54:25.869279+00:00
2 2018-07-24T19:47:25.869279+00:00 2018-07-24T19:52:25.869279+00:00
3 2018-07-24T19:45:25.869279+00:00 2018-07-24T19:50:25.869279+00:00
4 2018-07-24T19:50:25.869279+00:00 None
5 2018-07-24T19:54:25.869279+00:00 None
But as my event_data
is large, it's taking a long time.
So I was wondering if there's a faster way to run this.
Thanks for your help in advance!
Upvotes: 1
Views: 36
Reputation: 862761
Use DataFrameGroupBy.shift
with next processing - first sorting and then filtering only rows with pause_video
:
cols = ["user_id", "course_id", "lecture_id"]
event_data['next_real_time'] = event_data.groupby(cols)['real_time'].shift(-1)
mask = event_data['event_from'].eq('pause_video')
event_data = (event_data.sort_values(cols)
.loc[mask, ['real_time','next_real_time']])
print (event_data)
real_time next_real_time
0 2018-07-24 19:44:25.869279+00:00 2018-07-24 19:46:25.869279
1 2018-07-24 19:46:25.869279+00:00 2018-07-24 19:54:25.869279
6 2018-07-24 19:54:25.869279+00:00 NaT
2 2018-07-24 19:47:25.869279+00:00 2018-07-24 19:52:25.869279
3 2018-07-24 19:45:25.869279+00:00 2018-07-24 19:50:25.869279
4 2018-07-24 19:50:25.869279+00:00 NaT
Original solution:
groupby_event_data = event_data.groupby(["user_id", "course_id", "lecture_id"])
pause_event_data = pd.DataFrame(columns=['real_time', 'next_real_time'])
n = 0
for each_group, eachdf in groupby_event_data:
idx = 0
for index, row in eachdf.iterrows():
if row['event_from']=='pause_video':
if idx < len(eachdf)-1:
pause_event_data.loc[n] = [row.real_time, eachdf.iloc[idx+1]['real_time']]
else:
pause_event_data.loc[n] = [row.real_time, None]
n = n+1
idx = idx + 1
print (pause_event_data)
real_time next_real_time
0 2018-07-24 19:44:25.869279 2018-07-24 19:46:25.869279
1 2018-07-24 19:46:25.869279 2018-07-24 19:54:25.869279
2 2018-07-24 19:54:25.869279 NaT
3 2018-07-24 19:47:25.869279 2018-07-24 19:52:25.869279
4 2018-07-24 19:45:25.869279 2018-07-24 19:50:25.869279
5 2018-07-24 19:50:25.869279 NaT
Upvotes: 1
Reputation: 30991
To access cells from previous / next row, the standard method is shift (see the documentation). You can use it also in groupby.
In your case, instead of your loop, you can run:
pause_event_data = event_data[['real_time']].copy()
pause_event_data['next_real_time'] = event_data.groupby(
["user_id", "course_id", "lecture_id"]).real_time.shift(-1)
Additional remark: Your data should actually be sorted by real_time (in your sample they aren't).
Upvotes: 2