Reputation: 41
I am trying to do something similar to the example worked here at Geeks for Geeks.
In the example there they work through a dataframe of NBA (basketball) players and find the 5 smallest players (by height or weight) in the league.
All of this makes sense and I am able to apply it easily to the dataset I am working with until I want to do some more precise data slicing. For example, I want to find the 3 smallest players on each NBA team.
I would like the data output to be as follows:
Team | Player | Weight | Other Data x | Other Data y |
---|---|---|---|---|
Team A | Player 1 | smallest | xxxxxx | yyyyyy |
Team A | Player 2 | 2nd small | xxxxxx | yyyyyy |
Team A | Player 3 | 3rd small | xxxxxx | yyyyyy |
-------- | ----------- | ----------- | ------------- | ------------- |
Team B | Player 1 | smallest | xxxxxx | yyyyyy |
Team B | Player 2 | 2nd small | xxxxxx | yyyyyy |
Team B | Player 3 | 3rd small | xxxxxx | yyyyyy |
-------- | ----------- | ----------- | ------------- | ------------- |
Team C | Player 1 | smallest | xxxxxx | yyyyyy |
Team C | Player 2 | 2nd small | xxxxxx | yyyyyy |
Team C | Player 3 | 3rd small | xxxxxx | yyyyyy |
-------- | ----------- | ----------- | ------------- | ------------- |
I have tried doing both of the follow and neither of the work the way I would like.
df2 = df1.groupby('Team').nsmallest(3, ['Weight'])
df2 = df1.groupby('Team')['Weight'].nsmallest(3)
Any suggestions of how I can do multiple cuts of data? I also want to keep the data in all of the other columns, just not the team / player / weight data.
Upvotes: 1
Views: 1353
Reputation: 1565
You can take advantage of the fact that df.groupby('Team')["Weight"].nsmallest(3)
already gives you the indices you want in the index
columns of that result. So you can just use iloc
those indices in the original dataframe.
import pandas as pd
# making data frame
df = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
new_df = df.iloc[df.groupby('Team')["Weight"].nsmallest(3).index.get_level_values(1)]
Result:
>>> new_df.head(5)
Name Team Number ... Weight College Salary
318 Dennis Schroder Atlanta Hawks 17.0 ... 172.0 NaN 1763400.0
323 Jeff Teague Atlanta Hawks 0.0 ... 186.0 Wake Forest 8000000.0
311 Kirk Hinrich Atlanta Hawks 12.0 ... 190.0 Kansas 2854940.0
0 Avery Bradley Boston Celtics 0.0 ... 180.0 Texas 7730337.0
3 R.J. Hunter Boston Celtics 28.0 ... 185.0 Georgia State 1148640.0
[5 rows x 9 columns]
You can then set the Team
and Name
as a Multiindex:
>>> new_df.set_index(["Team", "Name"]).head(6)
Number Position ... College Salary
Team Name ...
Atlanta Hawks Dennis Schroder 17.0 PG ... NaN 1763400.0
Jeff Teague 0.0 PG ... Wake Forest 8000000.0
Kirk Hinrich 12.0 SG ... Kansas 2854940.0
Boston Celtics Avery Bradley 0.0 PG ... Texas 7730337.0
R.J. Hunter 28.0 SG ... Georgia State 1148640.0
Isaiah Thomas 4.0 PG ... Washington 6912869.0
[6 rows x 7 columns]
Upvotes: 3
Reputation: 3591
If you do
df2 = df1.groupby('Team').apply(lambda x: x.nsmallest(3, ['Weight'])
you'll get an iterable consisting of dataframes that each has the three smallest for a team. You can then append them together or put them in a multi-index dataframe.
Upvotes: 0
Reputation: 137
import pandas as pd
details = {
'Team' : ['A', 'A', 'A', 'A', 'B', 'B', 'B','B'],
'Age' : [23, 21, 22, 19,18,17,20,22],
}
# creating a Dataframe object
df = pd.DataFrame(details)
df=df.sort_values(by=['Team','Age'])
df['tmp']=1
df['seq']=df.groupby('Team')['tmp'].cumsum()
#filter out based on Nth basis on seq column
#filter code here
print(df.head())
later on you can concat based on the "smallest" word to seq column and drop tmp column.
Upvotes: 0
Reputation: 18406
Groupby Team
, apply pd.Series.nsmallest on Weight
, drop the level at 1, convertint it to frame, and inner merge it back to actual dataframe on Team
and Weight
df.groupby('Team')['Weight'].apply(lambda row: row.nsmallest(3)).to_frame('Weight').droplevel(1).merge(df, on=['Team', 'Weight'], how='inner')
Team Weight Name Number Position Age Height College Salary
0 Boston Celtics 180.0 Avery Bradley 0.0 PG 25.0 6-2 Texas 7730337.0
1 Boston Celtics 185.0 R.J. Hunter 28.0 SG 22.0 6-5 Georgia State 1148640.0
2 Boston Celtics 190.0 Terry Rozier 12.0 PG 22.0 6-2 Louisville 1824360.0
PS: While adding expected output, create the sample from original data, people might get confused from such fabricated expected output which is totally different than the actual data.
Upvotes: 0