Reputation: 487

Find top N values within each group

I have a dataset similar to the sample below:

| id | size   | old_a | old_b | new_a | new_b |
|----|--------|-------|-------|-------|-------|
| 6  | small  | 3     | 0     | 21    | 0     |
| 6  | small  | 9     | 0     | 23    | 0     |
| 13 | medium | 3     | 0     | 12    | 0     |
| 13 | medium | 37    | 0     | 20    | 1     |
| 20 | medium | 30    | 0     | 5     | 6     |
| 20 | medium | 12    | 2     | 3     | 0     |
| 12 | small  | 7     | 0     | 2     | 0     |
| 10 | small  | 8     | 0     | 12    | 0     |
| 15 | small  | 19    | 0     | 3     | 0     |
| 15 | small  | 54    | 0     | 8     | 0     |
| 87 | medium | 6     | 0     | 9     | 0     |
| 90 | medium | 11    | 1     | 16    | 0     |
| 90 | medium | 25    | 0     | 4     | 0     |
| 90 | medium | 10    | 0     | 5     | 0     |
| 9  | large  | 8     | 1     | 23    | 0     |
| 9  | large  | 19    | 0     | 2     | 0     |
| 1  | large  | 1     | 0     | 0     | 0     |
| 50 | large  | 34    | 0     | 7     | 0     |

This is the input for above table:

data=[[6,'small',3,0,21,0],[6,'small',9,0,23,0],[13,'medium',3,0,12,0],[13,'medium',37,0,20,1],[20,'medium',30,0,5,6],[20,'medium',12,2,3,0],[12,'small',7,0,2,0],[10,'small',8,0,12,0],[15,'small',19,0,3,0],[15,'small',54,0,8,0],[87,'medium',6,0,9,0],[90,'medium',11,1,16,0],[90,'medium',25,0,4,0],[90,'medium',10,0,5,0],[9,'large',8,1,23,0],[9,'large',19,0,2,0],[1,'large',1,0,0,0],[50,'large',34,0,7,0]]
data= pd.DataFrame(data,columns=['id','size','old_a','old_b','new_a','new_b'])

I want to have an output which will group the dataset on size and would list out top 2 id based on the values of 'new_a' column within each group of size. Since, some of the ids are repeating multiple times, I would want to sum the values of new_a for such ids and then find top 2 values. My final table should look like the one below:

| size   | id | new_a |
|--------|----|-------|
| large  | 9  | 25    |
| large  | 50 | 7     |
| medium | 13 | 32    |
| medium | 90 | 25    |
| small  | 6  | 44    |
| small  | 10 | 12    |

I have tried the below code but it isn't showing top 2 values of new_a for each group within 'size' column.

nlargest = data.groupby(['size','id'])['new_a'].sum().nlargest(2).reset_index()

Upvotes: 1

Answers (3)

Mykola Zotko

Reputation: 17911

You can chain two groupby methods:

data.groupby(['id', 'size'])['new_a'].sum().groupby('size').nlargest(2)\
.droplevel(0).to_frame('new_a').reset_index()

Output:

   id    size  new_a
0   9   large     25
1  50   large      7
2  13  medium     32
3  90  medium     25
4   6   small     44
5  10   small     12

Upvotes: 0

Ch3steR

Reputation: 20679

You can set size, id as the index to avoid double groupby here, and use Series.sum leveraging level parameter.

df.set_index(["size", "id"]).groupby(level=0).apply(
    lambda x: x.sum(level=1).nlargest(2)
).reset_index()

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

Upvotes: 2

Andrej Kesely

Reputation: 195573

print(
    df.groupby('size').apply(
        lambda x: x.groupby('id').sum().nlargest(2, columns='new_a')
    ).reset_index()[['size', 'id', 'new_a']]
)

Prints:

     size  id  new_a
0   large   9     25
1   large  50      7
2  medium  13     32
3  medium  90     25
4   small   6     44
5   small  10     12

Upvotes: 3

Find top N values within each group

Answers (3)

Related Questions