Reputation: 10150
Lets say I have a Pandas dataframe that looks like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"time": [100,200,300,400,100,200,300,np.nan],
"correct": [1,1,0,1,1,0,0,0]})
Printed:
correct time
0 1 100.0
1 1 200.0
2 0 300.0
3 1 400.0
4 1 100.0
5 0 200.0
6 0 300.0
7 0 NaN
I want to calculate the mean of time
for only the rows following rows where correct
is equal to 0
. So in the above dataframe I want to calculate the mean of 400
, 300
, and NaN
(which will give 350
)
I need to be careful of handling NaN
values, and literal edge cases where the last row has correct == 0
but there isn't a row following it
Whats the most efficient way to do this in Pandas without having to resort to looping through the dataframe (my current implementation)?
Upvotes: 2
Views: 966
Reputation: 210852
you can use shift() method:
In [55]: df.loc[df.correct.shift() == 0, 'time'].mean()
Out[55]: 350.0
Explanation:
In [53]: df.correct.shift()
Out[53]:
0 NaN
1 1.0
2 1.0
3 0.0
4 1.0
5 1.0
6 0.0
7 0.0
Name: correct, dtype: float64
In [54]: df.loc[df.correct.shift() == 0, 'time']
Out[54]:
3 400.0
6 300.0
7 NaN
Name: time, dtype: float64
Upvotes: 5