DomCobb
DomCobb

Reputation: 41

Get n smallest values of a subset of a dataframe in Python

I am trying to do something similar to the example worked here at Geeks for Geeks.

In the example there they work through a dataframe of NBA (basketball) players and find the 5 smallest players (by height or weight) in the league.

All of this makes sense and I am able to apply it easily to the dataset I am working with until I want to do some more precise data slicing. For example, I want to find the 3 smallest players on each NBA team.

I would like the data output to be as follows:

Team Player Weight Other Data x Other Data y
Team A Player 1 smallest xxxxxx yyyyyy
Team A Player 2 2nd small xxxxxx yyyyyy
Team A Player 3 3rd small xxxxxx yyyyyy
-------- ----------- ----------- ------------- -------------
Team B Player 1 smallest xxxxxx yyyyyy
Team B Player 2 2nd small xxxxxx yyyyyy
Team B Player 3 3rd small xxxxxx yyyyyy
-------- ----------- ----------- ------------- -------------
Team C Player 1 smallest xxxxxx yyyyyy
Team C Player 2 2nd small xxxxxx yyyyyy
Team C Player 3 3rd small xxxxxx yyyyyy
-------- ----------- ----------- ------------- -------------

I have tried doing both of the follow and neither of the work the way I would like.

  1. df2 = df1.groupby('Team').nsmallest(3, ['Weight'])
  2. df2 = df1.groupby('Team')['Weight'].nsmallest(3)

Any suggestions of how I can do multiple cuts of data? I also want to keep the data in all of the other columns, just not the team / player / weight data.

Upvotes: 1

Views: 1353

Answers (4)

Camilo Martínez M.
Camilo Martínez M.

Reputation: 1565

You can take advantage of the fact that df.groupby('Team')["Weight"].nsmallest(3) already gives you the indices you want in the index columns of that result. So you can just use iloc those indices in the original dataframe.

import pandas as pd
    
# making data frame 
df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv") 

new_df = df.iloc[df.groupby('Team')["Weight"].nsmallest(3).index.get_level_values(1)]

Result:

>>> new_df.head(5)
                    Name            Team  Number  ... Weight        College     Salary
    318  Dennis Schroder   Atlanta Hawks    17.0  ...  172.0            NaN  1763400.0
    323      Jeff Teague   Atlanta Hawks     0.0  ...  186.0    Wake Forest  8000000.0
    311     Kirk Hinrich   Atlanta Hawks    12.0  ...  190.0         Kansas  2854940.0
    0      Avery Bradley  Boston Celtics     0.0  ...  180.0          Texas  7730337.0
    3        R.J. Hunter  Boston Celtics    28.0  ...  185.0  Georgia State  1148640.0
    
    [5 rows x 9 columns]

You can then set the Team and Name as a Multiindex:

>>> new_df.set_index(["Team", "Name"]).head(6)
                                    Number Position  ...        College     Salary
    Team           Name                              ...                          
    Atlanta Hawks  Dennis Schroder    17.0       PG  ...            NaN  1763400.0
                   Jeff Teague         0.0       PG  ...    Wake Forest  8000000.0
                   Kirk Hinrich       12.0       SG  ...         Kansas  2854940.0
    Boston Celtics Avery Bradley       0.0       PG  ...          Texas  7730337.0
                   R.J. Hunter        28.0       SG  ...  Georgia State  1148640.0
                   Isaiah Thomas       4.0       PG  ...     Washington  6912869.0
    
    [6 rows x 7 columns]

Upvotes: 3

Acccumulation
Acccumulation

Reputation: 3591

If you do

df2 = df1.groupby('Team').apply(lambda x: x.nsmallest(3, ['Weight'])

you'll get an iterable consisting of dataframes that each has the three smallest for a team. You can then append them together or put them in a multi-index dataframe.

Upvotes: 0

RohitM
RohitM

Reputation: 137

import pandas as pd

details = {
    'Team' : ['A', 'A', 'A', 'A', 'B', 'B', 'B','B'],
    'Age' : [23, 21, 22, 19,18,17,20,22],
}
  
# creating a Dataframe object 
df = pd.DataFrame(details)

df=df.sort_values(by=['Team','Age'])
df['tmp']=1

df['seq']=df.groupby('Team')['tmp'].cumsum()
  
#filter out based on Nth basis on seq column
#filter code here

print(df.head())

later on you can concat based on the "smallest" word to seq column and drop tmp column.

Upvotes: 0

ThePyGuy
ThePyGuy

Reputation: 18406

Groupby Team, apply pd.Series.nsmallest on Weight, drop the level at 1, convertint it to frame, and inner merge it back to actual dataframe on Team and Weight

df.groupby('Team')['Weight'].apply(lambda row: row.nsmallest(3)).to_frame('Weight').droplevel(1).merge(df, on=['Team', 'Weight'], how='inner')

             Team  Weight           Name  Number Position   Age Height        College     Salary
0  Boston Celtics   180.0  Avery Bradley     0.0       PG  25.0    6-2          Texas  7730337.0
1  Boston Celtics   185.0    R.J. Hunter    28.0       SG  22.0    6-5  Georgia State  1148640.0
2  Boston Celtics   190.0   Terry Rozier    12.0       PG  22.0    6-2     Louisville  1824360.0

PS: While adding expected output, create the sample from original data, people might get confused from such fabricated expected output which is totally different than the actual data.

Upvotes: 0

Related Questions