Reputation: 8722
(There is probably a better way of asking the question, but hopefully this description will make it more clear)
A simplified view of my dataframe, showing 10 random rows, is:
Duration starting_station_id ending_station_id
5163 420 3077 3018
113379 240 3019 3056
9730 240 3047 3074
104058 900 3034 3042
93110 240 3055 3029
93144 240 3016 3014
48999 780 3005 3024
30905 360 3019 3025
88132 300 3022 3048
12673 240 3075 3031
What I want to do is groupby starting_station_id and ending_station_id and the filter out the rows where the value in the Duration column for a group falls above the .99 quantile.
To do the groupby and quantile computation, I do:
df.groupby( ['starting_station_id', 'ending_station_id'] )[ 'Duration' ].quantile([.99])
and some partial output is:
3005 3006 0.99 3825.6
3007 0.99 1134.0
3008 0.99 5968.8
3009 0.99 9420.0
3010 0.99 1740.0
3011 0.99 41856.0
3014 0.99 22629.6
3016 0.99 1793.4
3018 0.99 37466.4
What I believe this is telling me is that for the group ( 3005, 3006 ), the values >= 3825.6 fall into the .99 quantile. So, I want to filter out the rows where the duration value for that group is >= 3825.6. (And then do the same for all of the other groups)
What is the best way to do this?
Upvotes: 1
Views: 429
Reputation: 136
Try this
thresholds = df.groupby(['start', 'end'])['x'].quantile(.99)
mask = (df.Duration.values > thresholds[[(x, y) for x, y in zip(df.start, df.end)]]).values
out = df[mask]
Upvotes: 1