Simon
Simon

Reputation: 10150

Pandas mean of rows following a matching condition

Lets say I have a Pandas dataframe that looks like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({"time": [100,200,300,400,100,200,300,np.nan],
                   "correct": [1,1,0,1,1,0,0,0]})

Printed:

   correct   time
0        1  100.0
1        1  200.0
2        0  300.0
3        1  400.0
4        1  100.0
5        0  200.0
6        0  300.0
7        0    NaN

I want to calculate the mean of time for only the rows following rows where correct is equal to 0. So in the above dataframe I want to calculate the mean of 400, 300, and NaN (which will give 350)

I need to be careful of handling NaN values, and literal edge cases where the last row has correct == 0 but there isn't a row following it

Whats the most efficient way to do this in Pandas without having to resort to looping through the dataframe (my current implementation)?

Upvotes: 2

Views: 966

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

you can use shift() method:

In [55]: df.loc[df.correct.shift() == 0, 'time'].mean()
Out[55]: 350.0

Explanation:

In [53]: df.correct.shift()
Out[53]:
0    NaN
1    1.0
2    1.0
3    0.0
4    1.0
5    1.0
6    0.0
7    0.0
Name: correct, dtype: float64

In [54]: df.loc[df.correct.shift() == 0, 'time']
Out[54]:
3    400.0
6    300.0
7      NaN
Name: time, dtype: float64

Upvotes: 5

Related Questions