Reputation: 800
I am practising on the IMDB dataset and i would like to find the top genres that had the most budget.
Actually that would be useful in situations where a boxplot is needed and the genres are numerous. Thus, minimising them to the most expensive would make the boxplot more clear.
i tried this: df.sort_values(by=["genres","budget"])
but it isn't right.
Upvotes: 3
Views: 1664
Reputation: 862511
If need return all columns:
I think you need sort_values
+ groupby
+ head
:
df=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(5)
Or nlargest
:
df = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(5, "budget"))
If need retun only genres
and budget
columns:
df = df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
Samples:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aaabbb')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
5 f 4 3 4 0 b
df1=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(2)
df1 = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(2, "budget"))
print (df1)
A B C E budget genres
2 c 4 9 6 5 a
1 b 5 8 3 3 a
3 d 5 4 9 7 b
4 e 5 2 2 1 b
df1=df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
print (df1)
genres budget
0 a 5
1 a 3
2 b 7
3 b 1
If need top genres with sum of badget per genres:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'budget':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'genres':list('aabbcc')})
print (df)
A B C E budget genres
0 a 4 7 5 1 a
1 b 5 8 3 3 a
2 c 4 9 6 5 b
3 d 5 4 9 7 b
4 e 5 2 2 1 c
5 f 4 3 4 0 c
df = df.groupby('genres')['budget'].sum().nlargest(2)
print (df)
genres
b 12
a 4
Name: budget, dtype: int64
Detail:
print (df.groupby('genres')['budget'].sum())
genres
a 4
b 12
c 1
Name: budget, dtype: int64
Upvotes: 1