Reputation: 1286
Say I have a dataset with 10000 movies in https://bpaste.net/show/05fa224794e4, with an ercerpt of the dataset being
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
tt0137523 Fight Club (1999) 1999 8.8 458173 139 mins. Drama|Mystery|Thriller
tt0133093 The Matrix (1999) 1999 8.7 448114 136 mins. Action|Adventure|Sci-Fi
tt1375666 Inception (2010) 2010 8.9 385149 148 mins. Action|Adventure|Sci-Fi|Thriller
tt0109830 Forrest Gump (1994) 1994 8.7 368994 142 mins. Comedy|Drama|Romance
tt0169547 American Beauty (1999) 1999 8.6 338332 122 mins. Drama
tt0499549 Avatar (2009) 2009 8.1 336855 162 mins. Action|Adventure|Fantasy|Sci-Fi
tt0108052 Schindler's List (1993) 1993 8.9 325888 195 mins. Biography|Drama|History|War
tt0080684 Star Wars: Episode V - The Empire Strikes Back (1980) 1980 8.8 320105 124 mins. Action|Adventure|Family|Sci-Fi
tt0372784 Batman Begins (2005) 2005 8.3 316613 140 mins. Action|Crime|Drama|Thriller
tt0114814 The Usual Suspects (1995) 1995 8.7 306624 106 mins. Crime|Mystery|Thriller
tt0102926 The Silence of the Lambs (1991) 1991 8.7 293081 118 mins. Crime|Thriller
tt0120338 Titanic (1997) 1997 7.4 284245 194 mins. Adventure|Drama|History|Romance
I have this gist of code that loads my dataset and performs some changes to it
import pandas as pd
import numpy as np
headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
movies.head()
one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
movies = pd.concat([movies, one_hot_encoding], axis=1)
movies_top_250 = movies.sort_values('score', ascending=False).head(250)
Given this
I was thinking of a pivot table maybe? Here using only a subset of genre columns.
pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)
Action Adult votes
title
The Shawshank Redemption (1994) 0 0 619479
The Dark Knight (2008) 1 0 555122
Pulp Fiction (1994) 0 0 490065
The Godfather (1972) 0 0 474189
Fight Club (1999) 0 0 458173
The Lord of the Rings: The Fellowship of the Ri... 1 0 451263
The Matrix (1999) 1 0 448114
The Lord of the Rings: The Return of the King (... 1 0 428791
Inception (2010) 1 0 385149
The Lord of the Rings: The Two Towers (2002) 1 0 383113
Forrest Gump (1994) 0 0 368994
But this does not tell which genre has the majority of votes. Also
movies.groupby('genres').score.mean()
Returns something like
genres
Action 5.837500
Action|Adventure 6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy 7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi 6.100000
Action|Adventure|Biography|Crime|History|Western 6.300000
Action|Adventure|Biography|Drama|History 7.700000
So I can't really grasp my head about this. For the first question, I'm thinking of getting something like
Genre mean_score votes_sum
Action 7.837500 103237
Adventure 6.152381 103226
Animation 5.500000 103275
Upvotes: 1
Views: 856
Reputation: 251
You can use this oneline solution (escaping linebreaks for pretty format only):
movies = \
(movies.set_index(mv.columns.drop('genres',1).tolist())
.genres.str.split('|',expand=True)
.stack()
.reset_index()
.rename(columns={0:'genre'})
.loc[:,['genre','score','votes']]
.groupby('genre').agg({'score':['mean'], 'votes':['sum']})
)
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
...
The main issue is the multiple True
values resultant from the one_hot_encoding
process over genres. A single movie can be assigned to 1 or more genres. Hence, you cannot use aggregation methods properly by genre. By the other hand, using the genres
field as is will dissolve multiple genders results as you showed in your question:
genres
Action 5.837500
Action|Adventure 6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy 7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi 6.100000
Action|Adventure|Biography|Crime|History|Western 6.300000
Action|Adventure|Biography|Drama|History 7.700000
A workaround is duplicating rows when more than one gender is found. Using a combination of split
with the expand
method set to True
, you can create multiple dataframes and then stack them. For example, a movie with 2 genres will appear in 2 of the resultant dataframes, where each dataframe represent the movies assigned to each genre. Finally, after parsing, you can aggregate by gender with multiple functions. I will explain step by step:
Load data:
import pandas as pd
import numpy as np
headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
Notice that you have null values in genres
field:
imdbID title year score votes runtime genres
7917 tt0990404 Chop Shop (2007) 2007 7.2 2104 84 mins. NaN
Since aggregation methods with Pandas will omit rows with any null value, and we have only 1 movie with a null value on this field, can be set manually (checked on Imdb):
movies.loc[movies.genres.isnull(),"genres"] = "Drama"
Now, as you already showed, we need the top 250 movies by score:
movies = movies.sort_values('score', ascending=False).head(250)
Keep only genres field as column, keep other fields as index. This is for ease on working over genres.
movies = movies.set_index(movies.columns.drop('genres',1).tolist())
genres
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy|Drama
(250, 1)
This will create N dataframes from N iterations of split.
movies = movies.genres.str.split('|',expand=True)
0 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy
1 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. None
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Drama
...
Now you have a unique value of genre for each movie, where a movie can have more than 1 row if more than 1 genre was assigned, you can stack the set of dataframes. Notice that now we have more than 250 rows (662 rows), but are 250 distinct movies.
movies = movies.stack()
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. 0 Crime
1 Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. 0 Crime
1 Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. 0 Western
dtype: object
(662,)
Get suitable data structure before aggregation:
# Multiple index to columns
movies = movies.reset_index()
# Name the new column for genre
movies = movies.rename(columns={0:'genre'})
# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]
genre score votes
0 Crime 9.2 619479
1 Drama 9.2 619479
2 Crime 9.2 474189
3 Drama 9.2 474189
4 Western 9.0 195238
(662, 3)
As you requested, score must be aggregated by mean and votes by sum:
movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
(21, 2)
Upvotes: 1
Reputation: 13903
import io
import numpy as np
import pandas as pd
colnames = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
data_url = 'https://bpaste.net/raw/05fa224794e4'
movies = pd.read_csv(data_url, sep="\t", header=None, names=colnames, encoding='UTF-8', index_col='imdbID')
And a useful function
def arg_nlargest(x, n, use_index=True):
if isinstance(x, pd.Series):
x = x.values
return np.argpartition(-x, n)[:n]
First get the top 250 movies:
top250_iloc = arg_nlargest(movies['score'], 250)
movies250 = movies.iloc[top250_iloc]
Next, we expand the genres from each movie into indicators, as you did
movies250_genre_inds = movies250["genres"].str.get_dummies(sep='|')
The naive way to proceed is to loop over indicator columns, collecting aggregates for each genre.
genre_agg = {}
for genre in movies250_genre_inds.columns:
mask = movies250_genre_inds[genre].astype(bool)
aggregates = movies250.loc[mask].agg({'score': 'mean', 'votes': 'sum'})
genre_agg[genre] = aggregates.tolist()
genre_agg = pd.DataFrame.from_dict(genre_agg, orient='index', columns=['score_mean', 'votes_sum'])
genre3_iloc = arg_nlargest(genre_agg['score_mean'], 3)
genre3 = genre_agg.iloc[genre3_iloc].sort_values('score_mean', ascending=False)
Upvotes: 0