ah bon
ah bon

Reputation: 10041

Group by another column and extract top values of one column in Pandas

I want to get multiple customized percentiles (0.10, 0.20, ..., 0.50) values for each type from the following dataframe:

    type    value
0   a   90
1   a   63
2   a   11
3   a   61
4   a   78
5   a   67
6   a   89
7   a   12
8   a   43
9   a   30
10  b   72
11  b   84
12  b   74
13  b   66
14  b   80
15  b   75
16  b   47
17  b   22
18  b   5
19  b   64
20  b   32
21  b   45

I have proceed to this step:

df['percentile_rank'] = df.groupby('type').value.rank(pct=True).round(2)

Ouput:

    type    value   percentile_rank
0   a   90  1.00
1   a   63  0.60
2   a   11  0.10
3   a   61  0.50
4   a   78  0.80
5   a   67  0.70
6   a   89  0.90
7   a   12  0.20
8   a   43  0.40
9   a   30  0.30
10  b   72  0.67
11  b   84  1.00
12  b   74  0.75
13  b   66  0.58
14  b   80  0.92
15  b   75  0.83
16  b   47  0.42
17  b   22  0.17
18  b   5   0.08
19  b   64  0.50
20  b   32  0.25
21  b   45  0.33

But I dont't know how to get an expected result like this:

    type    top10   top20   top30   top40   top50
0     a       89      78      67      63      61
1     b       80      75      72      66      64

In the table above top10 represents percentile_rank equals 0.90, top20 for 0.80, etc. If there are no exact percentile values, then we take the closest values, for example, top10 for type of b, I use the value of 80 whose percentile_rank is 0.92.

Thanks for your help at advance.

Update:

Output from Andy L.'s method, you can noticed NaNs for top55 and top45:

   type  top95  top90  top85  top80  top75  top70  top65  top60  top55  top50  \
0   e  40.82  41.81  41.82  42.35  43.85  44.42  44.99  45.92    NaN  45.94   

   top45  top40  top35  top30  top25  top20  top15  top10   top5  
0    NaN  46.04  46.25  46.45  46.85  47.49  48.55  49.82  52.18 

Output from YOBEN_S's method:

   type   top95  top90  top85   top80  top75  top70  top65  top60  top55  top50  \
0   e   40.704  41.82  41.82  42.326   43.7  44.36  44.94  45.94  45.94  45.94   

   top45  top40   top35  top30  top25   top20   top15   top10    top5  
0  45.94  46.04  46.226  46.42  46.82  47.412  48.412  49.776  52.008  

Upvotes: 0

Views: 1406

Answers (2)

Andy L.
Andy L.

Reputation: 25259

If you want use the existing values, I guess you may use pd.cut and groupby max as follows

bins = [0, 0.5, 0.6, 0.7, 0.8, 0.99]
labels = ['top50', 'top40', 'top30', 'top20', 'top10']
s = pd.cut(df.percentile_rank, bins=bins, labels=labels, right=True)
df_out = df.groupby(['type', s]).value.max().unstack()

Out[57]:
percentile_rank  top50  top40  top30  top20  top10
type
a                   61     63     67     78     89
b                   64     66     72     74     80

Upvotes: 1

BENY
BENY

Reputation: 323306

We could do quantile

s=df.groupby('type').value.apply(lambda x : x.quantile([0.9,.8,.7,.6,.5])).unstack()
Out[64]: 
       0.9   0.8   0.7   0.6   0.5
type                              
a     89.1  80.2  70.3  64.6  62.0
b     79.5  74.8  73.4  69.6  65.0

Upvotes: 3

Related Questions