gsa
gsa

Reputation: 800

How to find the top 5 values of a column according to another column?

I am practising on the IMDB dataset and i would like to find the top genres that had the most budget.

Actually that would be useful in situations where a boxplot is needed and the genres are numerous. Thus, minimising them to the most expensive would make the boxplot more clear.

i tried this: df.sort_values(by=["genres","budget"]) 

but it isn't right.

Upvotes: 3

Views: 1664

Answers (1)

jezrael
jezrael

Reputation: 862511

If need return all columns:

I think you need sort_values + groupby + head:

df=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(5)

Or nlargest:

df = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(5, "budget"))

If need retun only genres and budget columns:

df = df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()

Samples:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'budget':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'genres':list('aaabbb')})

print (df)
   A  B  C  E  budget genres
0  a  4  7  5       1      a
1  b  5  8  3       3      a
2  c  4  9  6       5      a
3  d  5  4  9       7      b
4  e  5  2  2       1      b
5  f  4  3  4       0      b

df1=df.sort_values(by=["genres","budget"], ascending=[True, False]).groupby("genres").head(2)

df1 = df.groupby('genres', group_keys=False).apply(lambda x: x.nlargest(2, "budget"))

print (df1)
   A  B  C  E  budget genres
2  c  4  9  6       5      a
1  b  5  8  3       3      a
3  d  5  4  9       7      b
4  e  5  2  2       1      b

df1=df.groupby('genres')["budget"].nlargest(2).reset_index(level=1, drop=True).reset_index()
print (df1)
  genres  budget
0      a       5
1      a       3
2      b       7
3      b       1

---

If need top genres with sum of badget per genres:

df = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'budget':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'genres':list('aabbcc')})

print (df)
   A  B  C  E  budget genres
0  a  4  7  5       1      a
1  b  5  8  3       3      a
2  c  4  9  6       5      b
3  d  5  4  9       7      b
4  e  5  2  2       1      c
5  f  4  3  4       0      c

df = df.groupby('genres')['budget'].sum().nlargest(2)
print (df)
genres
b    12
a     4
Name: budget, dtype: int64

Detail:

print (df.groupby('genres')['budget'].sum())
genres
a     4
b    12
c     1
Name: budget, dtype: int64

Upvotes: 1

Related Questions