ericg
ericg

Reputation: 8722

How can I filter dataframe rows based on a quantile value of a column using groupby?

(There is probably a better way of asking the question, but hopefully this description will make it more clear)

A simplified view of my dataframe, showing 10 random rows, is:

      Duration  starting_station_id  ending_station_id
5163         420                 3077               3018
113379       240                 3019               3056
9730         240                 3047               3074
104058       900                 3034               3042
93110        240                 3055               3029
93144        240                 3016               3014
48999        780                 3005               3024
30905        360                 3019               3025
88132        300                 3022               3048
12673        240                 3075               3031

What I want to do is groupby starting_station_id and ending_station_id and the filter out the rows where the value in the Duration column for a group falls above the .99 quantile.

To do the groupby and quantile computation, I do:

df.groupby( ['starting_station_id', 'ending_station_id'] )[ 'Duration' ].quantile([.99])

and some partial output is:

3005                 3006               0.99     3825.6
                     3007               0.99     1134.0
                     3008               0.99     5968.8
                     3009               0.99     9420.0
                     3010               0.99     1740.0
                     3011               0.99    41856.0
                     3014               0.99    22629.6
                     3016               0.99     1793.4
                     3018               0.99    37466.4

What I believe this is telling me is that for the group ( 3005, 3006 ), the values >= 3825.6 fall into the .99 quantile. So, I want to filter out the rows where the duration value for that group is >= 3825.6. (And then do the same for all of the other groups)

What is the best way to do this?

Upvotes: 1

Views: 429

Answers (1)

arra
arra

Reputation: 136

Try this

 thresholds = df.groupby(['start', 'end'])['x'].quantile(.99)
 mask = (df.Duration.values > thresholds[[(x, y) for x, y in zip(df.start, df.end)]]).values
 out = df[mask]

Upvotes: 1

Related Questions