hiddenfile
hiddenfile

Reputation: 23

pandas groupby include a column in final result

        cast                year    revenue         title
id              
135397  Chris Pratt         2015    1.392446e+09    Jurassic World
135397  Bryce Dallas Howard 2015    1.392446e+09    Jurassic World
135397  Irrfan Khan         2015    1.392446e+09    Jurassic World
135397  Nick Robinson       2015    1.392446e+09    Jurassic World  

Given the above DataFrame, I would like to find the highest earning actors per year (based on the combined revenue of movies they acted in that year). This is what I have so far :

#get the total revenue associated with each cast for each year
f ={'revenue':sum}
#revenue by year for each cast
df_actor_yr = df_actor_yr.groupby(['year', 'cast']).agg(f)
df_actor_yr
year    cast    
1960    Anthony Perkins     2.359350e+08
        Charles Laughton    4.423780e+08
        Fred MacMurray      1.843242e+08
        Jack Kruschen       1.843242e+08
        Jean Simmons        4.423780e+08
        John Gavin          2.359350e+08
        Kirk Douglas        4.423780e+08
        Vera Miles          2.359350e+08
1961    Anthony Quayle      2.108215e+08
        Anthony Quinn       2.108215e+08
        Ben Wright          1.574815e+09
        Betty Lou Gerson    1.574815e+09
        ...

Next to get the highest earning cast member for each year I did the following

df_actor_yr.reset_index(inplace=True)
g ={"revenue" : max }
df_actor_yr = df_actor_yr.groupby('year').agg(g)

df_actor_yr   

        revenue
year    
1960    4.423780e+08
1961    1.574815e+09
1962    5.045914e+08
1963    5.617734e+08
1964    8.780804e+08
1965    1.129535e+09
1967    1.345551e+09
1968    4.187094e+08
1969    6.081511e+08
...

This only gives me the year and maximum revenue for that year.I would also like to get the corresponding name of the cast member associated with the revenue. How do I go about doing this?

Upvotes: 1

Views: 60

Answers (1)

jpp
jpp

Reputation: 164673

You can split your logic into 2 steps. First sum by cast and year using GroupBy + sum. Then find the maximum revenue by year using GroupBy + idxmax:

# sum by cast and year
df_summed = df.groupby(['cast', 'year'])['revenue'].sum().reset_index()

# maximums by year
res = df_summed.loc[df_summed.groupby('year')['revenue'].idxmax()]

print(res)

                cast  year       revenue
3       NickRobinson  2012  3.401340e+09
0  BryceDallasHoward  2015  1.568978e+09

For the above output, I've used more interesting data:

id      cast               year    revenue         title
135397  ChrisPratt         2015    1.392446e+09    JurassicWorld
135397  BryceDallasHoward  2015    1.568978e+09    SomeMovie
135397  IrrfanKhan         2012    1.392446e+09    JurassicWorld
135397  NickRobinson       2012    1.046987e+09    JurassicWorld  
135398  NickRobinson       2012    2.354353e+09    SomeOtherMovie  

Upvotes: 1

Related Questions