user308827
user308827

Reputation: 22041

Find percentile in pandas dataframe based on groups

Season  Name    value
2001    arkansas    3.497
2002    arkansas    3.0935
2003    arkansas    3.3625
2015    arkansas    3.766
2001    colorado    2.21925
2002    colorado    1.4795
2010    colorado    2.89175
2011    colorado    2.48825
2012    colorado    2.08475
2013    colorado    1.68125
2014    colorado    2.5555
2015    colorado    2.48825

In the dataframe above, I want to identify top and bottom 10 percentile values in column value for each state (arkansas and colorado). How do I do that? I can identify top and bottom percentile for entire value column like so:

np.searchsorted(np.percentile(a, [10, 90]), a))

Upvotes: 1

Views: 6325

Answers (2)

Dave Rosenman
Dave Rosenman

Reputation: 1467

If the variable for your dataframe is df, this should work. I'm not sure what you want your output to look like, but I just created code for a dictionary, where each key is a state. Also, since you have very few values, I used the option "nearest" for the argument interpolation (the default value is interpolation). To see the possible options, check out the documentation for the function here.

import pandas as pd
import numpy as np
df = pd.read_csv('stacktest.csv')
#array of unique state names from the dataframe
states = np.unique(df['Name'])
#empty dictionary 
state_data = dict()
for state in states:
    state_data[state] = np.percentile(df[df['Name'] == state]['value'],[10,90],interpolation = 'nearest')
print(state_data)

enter image description here

Upvotes: 0

cs95
cs95

Reputation: 403218

You can use groupby + quantile:

df.groupby('Name')['value'].quantile([.1, .9])

Name
arkansas  0.1    3.174200
          0.9    3.685300
colorado  0.1    1.620725
          0.9    2.656375
Name: value, dtype: float64

And then call np.searchsorted.


Alternatively, use qcut.

df.groupby('Name').apply(lambda x: 
       pd.qcut(x['value'], [.1, .9]))

Name
arkansas  0     (3.173, 3.685]
          1                NaN
          2     (3.173, 3.685]
          3                NaN
colorado  4      (1.62, 2.656]
          5                NaN
          6                NaN
          7      (1.62, 2.656]
          8      (1.62, 2.656]
          9      (1.62, 2.656]
          10     (1.62, 2.656]
          11     (1.62, 2.656]
Name: value, dtype: object

Upvotes: 3

Related Questions