Reputation: 393
I am working on a personal project using IMDB Data and have currently exhausted all avenues.
Quick Overview:
So far, I have executed the following:
plt.subplot2grid((2,3),(0,1))
actor_1 = df.pivot_table(index="Actor_1", values="Gross_Earnings", aggfunc='sum').sort_values(ascending=False)
actor_1[:15].sort_values(ascending=True).plot(kind='barh', width=0.7, alpha=0.5, color='red')
ax.tick_params(axis='x', labelsize=60)
ax.tick_params(axis='y', labelsize=60)
plt.xlabel("Gross Earnings")
plt.tight_layout()
plt.show()
This works, but it only returns the top values; not the top values with the additional criteria of > 4 Films.
I have also tried the following:
no_of_films = df.groupby("Actor_1")
name_count_key = df["Actor_1"].value_counts().to_dict()
no_of_films["Films"] = no_of_films["Actor_1"].map(name_count_key)
But it returns the following error: "AttributeError: Cannot access callable attribute 'map' of 'SeriesGroupBy' objects, try using the 'apply' method"
no_of_films = df.groupby("Actor_1")
name_count_key = df["Actor_1"].value_counts().to_dict()
no_of_films["Films"] = no_of_films["Actor_1"].apply(name_count_key)
But it returns the following error: "TypeError: unhashable type: 'dict'"
The group-by function idea was to create a new column called "Films" so count the volume of Films each actor has starred in and then use > 4 but it returns bools and not the actual value.
Director Actor_1 IMDB_Score Gross_Earnings Movie_Title
Andrew Stanton Daryl Sabara 6.6 73058679 John Carter
Sam Raimi J.K. Simmons 6.2 336530303 Spider-Man 3
Nathan Greno Brad Garrett 7.8 200807262 Tangled
Joss Whedon Chris Hemsworth 7.5 458991599 Avengers: Age of Ultron
Is this possible or am I being silly?
Any help would be greatly appreciated.
Thanks,
Adrian
Upvotes: 0
Views: 474
Reputation: 862511
I think you need filter or boolean indexing
with transform
:
print (df)
Director Actor_1 IMDB_Score Gross_Earnings Movie_Title
0 James Cameron CCH Pounder 7.9 760505847 Avatar
1 James Cameron CCH Pounder 7.9 760505847 Avatar1
2 James Cameron CCH Pounder 7.9 760505847 Avatar2
3 James Cameron CCH Pounder 7.9 760505847 Avatar3
4 Gore Verbinski Johnny Depp 7.1 309404152 Pirates
5 Sam Mendes Christoph Waltz 6.8 200074175 Spectre
6 Gore Verbinski Johnny Depp 7.1 309404152 Pirates1
7 Sam Mendes Christoph Waltz 6.8 200074175 Spectre1
8 Christopher Nolan Tom Hardy 8.5 448130642 The
df1 = df.groupby(["Actor_1"]).filter(lambda x: len(x) > 3)
print (df1)
Director Actor_1 IMDB_Score Gross_Earnings Movie_Title
0 James Cameron CCH Pounder 7.9 760505847 Avatar
1 James Cameron CCH Pounder 7.9 760505847 Avatar1
2 James Cameron CCH Pounder 7.9 760505847 Avatar2
3 James Cameron CCH Pounder 7.9 760505847 Avatar3
Or faster solution:
nofilms = df.groupby(["Actor_1"])['Movie_Title'].transform('size')
df1 = df[nofilms > 3]
print (df1)
Director Actor_1 IMDB_Score Gross_Earnings Movie_Title
0 James Cameron CCH Pounder 7.9 760505847 Avatar
1 James Cameron CCH Pounder 7.9 760505847 Avatar1
2 James Cameron CCH Pounder 7.9 760505847 Avatar2
3 James Cameron CCH Pounder 7.9 760505847 Avatar3
Then use groupby
and aggregate mean
:
df2 = df1.groupby('Actor_1')['Gross_Earnings'].mean()
print (df2)
Actor_1
CCH Pounder 760505847
Name: Gross_Earnings, dtype: int64
And last plot by Series.plot.barh
:
df2.plot.barh()
Upvotes: 1