Reputation: 1963
I have a dataset of actors and directors and the popularity of the movie that they have worked together in.
print (actors_director_df.head(3))
actor director popularity counter
0 Chris Pratt Colin Trevorrow 32.985763 0
1 Bryce Dallas Howard Colin Trevorrow 32.985763 0
2 Irrfan Khan Colin Trevorrow 32.985763 0
I want to group by using actor and director because a pair can work in more than one film. which I successfully did use below query.
actor_director_grouped = actors_director_df.groupby(['actor','director']) \
.size() \
.reset_index(name='count') \
.sort_values(['count'], ascending=False) \
.head(10)
print (actor_director_grouped)
actor director count
3619 Clint Eastwood Clint Eastwood 14
19272 Woody Allen Woody Allen 12
9606 Johnny Depp Tim Burton 8
But the popularity column goes missing in this DF.
What I want to do is to do a mean of popularity column after groupby and show the mean in front of actor and director along with the count of the number of movies they did together.
i.e. my ideal output would be something like this.
actor director popularity count
3619 Clint Eastwood Clint Eastwood 32.985763 14
19272 Woody Allen Woody Allen 5.1231231 12
9606 Johnny Depp Tim Burton 3.1231231 8
Upvotes: 2
Views: 255
Reputation: 18914
Looking at your dataframe the counter
columns seems unnecessary. Let us instead use the popularity column and make one mean
and one count
column:
import pandas as pd
import numpy as np
np.random.seed(444)
names = [
'Robert Baratheon',
'Jon Snow',
'Daenerys Targaryen',
'Theon Greyjoy',
'Tyrion Lannister'
]
df = pd.DataFrame({
'actor': np.random.choice(names, size=10, p = [0.2,0.2,0.2,0.1,0.3]),
'director': np.random.choice(names, size=10, p = [0.4,0.1,0.1,0.1,0.3]),
'popularity': np.random.randint(0,100, size=10),
'counter': 0
})
df2 = df.groupby(['actor','director'])['popularity']\
.agg(['count', 'mean'])\
.reset_index()\
.sort_values(by='mean', ascending=False)
print(df2)
Returns:
actor director count mean
0 Jon Snow Robert Baratheon 2 53.5
5 Tyrion Lannister Tyrion Lannister 2 49.0
2 Robert Baratheon Tyrion Lannister 2 48.5
1 Robert Baratheon Jon Snow 2 40.5
4 Theon Greyjoy Tyrion Lannister 1 13.0
3 Theon Greyjoy Robert Baratheon 1 7.0
Upvotes: 4
Reputation: 8641
I took to liberty to add some dummy data that would help understand the groupby
clause better.
print(df)
Output:
actor director popularity counter
0 Chris Pratt Colin Trevorrow 32.985763 0
1 Bryce Dallas Howard Colin Trevorrow 32.985763 0
2 Irrfan Khan Colin Trevorrow 32.985763 0
3 Irrfan Khan Colin Trevorrow 60.000000 12
4 Irrfan Khan John Markson 10.000000 10
5 Irrfan Khan Mark Johnson 100.000000 4
Then you need to groupby
on actor
and director
and then find mean
for popularity
and sum
for count
.
g = df.groupby(['actor', 'director'], as_index=False)
count = g.size().reset_index(name='count')
grp = g.agg({'popularity':'mean'})
grp.merge(count)
Output:
actor director popularity count
0 Bryce Dallas Howard Colin Trevorrow 32.985763 1
1 Chris Pratt Colin Trevorrow 32.985763 1
2 Irrfan Khan Colin Trevorrow 46.492881 2
3 Irrfan Khan John Markson 10.000000 1
4 Irrfan Khan Mark Johnson 100.000000 1
Upvotes: 2