Aggregation on one-hot-encoded dataframes

Question

Say I have a dataset with 10000 movies in https://bpaste.net/show/05fa224794e4, with an ercerpt of the dataset being

tt0111161   The Shawshank Redemption (1994) 1994     9.2    619479  142 mins.   Crime|Drama
tt0110912   Pulp Fiction (1994) 1994     9.0    490065  154 mins.   Crime|Thriller
tt0137523   Fight Club (1999)   1999     8.8    458173  139 mins.   Drama|Mystery|Thriller
tt0133093   The Matrix (1999)   1999     8.7    448114  136 mins.   Action|Adventure|Sci-Fi
tt1375666   Inception (2010)    2010     8.9    385149  148 mins.   Action|Adventure|Sci-Fi|Thriller
tt0109830   Forrest Gump (1994) 1994     8.7    368994  142 mins.   Comedy|Drama|Romance
tt0169547   American Beauty (1999)  1999     8.6    338332  122 mins.   Drama
tt0499549   Avatar (2009)   2009     8.1    336855  162 mins.   Action|Adventure|Fantasy|Sci-Fi
tt0108052   Schindler's List (1993) 1993     8.9    325888  195 mins.   Biography|Drama|History|War
tt0080684   Star Wars: Episode V - The Empire Strikes Back (1980)   1980     8.8    320105  124 mins.   Action|Adventure|Family|Sci-Fi
tt0372784   Batman Begins (2005)    2005     8.3    316613  140 mins.   Action|Crime|Drama|Thriller
tt0114814   The Usual Suspects (1995)   1995     8.7    306624  106 mins.   Crime|Mystery|Thriller
tt0102926   The Silence of the Lambs (1991) 1991     8.7    293081  118 mins.   Crime|Thriller
tt0120338   Titanic (1997)  1997     7.4    284245  194 mins.   Adventure|Drama|History|Romance

I have this gist of code that loads my dataset and performs some changes to it

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="	", header=None, names=headers, encoding='UTF-8')
movies.head()

one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
movies = pd.concat([movies, one_hot_encoding], axis=1)

movies_top_250 = movies.sort_values('score', ascending=False).head(250)

Given this

We want to find out the best three genres from a pool of the first 250 best rated
Also the average of the films in these genres
Also we want to find out the average score of the three worst rated genres within this pool of films

I was thinking of a pivot table maybe? Here using only a subset of genre columns.

pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)

                                                    Action  Adult   votes
title                                                                    
The Shawshank Redemption (1994)                          0      0  619479
The Dark Knight (2008)                                   1      0  555122
Pulp Fiction (1994)                                      0      0  490065
The Godfather (1972)                                     0      0  474189
Fight Club (1999)                                        0      0  458173
The Lord of the Rings: The Fellowship of the Ri...       1      0  451263
The Matrix (1999)                                        1      0  448114
The Lord of the Rings: The Return of the King (...       1      0  428791
Inception (2010)                                         1      0  385149
The Lord of the Rings: The Two Towers (2002)             1      0  383113
Forrest Gump (1994)                                      0      0  368994

But this does not tell which genre has the majority of votes. Also

movies.groupby('genres').score.mean()

Returns something like

genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

So I can't really grasp my head about this. For the first question, I'm thinking of getting something like

Genre          mean_score  votes_sum
Action           7.837500     103237
Adventure        6.152381     103226
Animation        5.500000     103275

manuelmatas · Accepted Answer

You can use this oneline solution (escaping linebreaks for pretty format only):

movies = \
    (movies.set_index(mv.columns.drop('genres',1).tolist())
    .genres.str.split('|',expand=True)
    .stack()
    .reset_index()
    .rename(columns={0:'genre'})
    .loc[:,['genre','score','votes']]
    .groupby('genre').agg({'score':['mean'], 'votes':['sum']})
    )

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
...

EXPLANATION:

The main issue is the multiple True values resultant from the one_hot_encoding process over genres. A single movie can be assigned to 1 or more genres. Hence, you cannot use aggregation methods properly by genre. By the other hand, using the genres field as is will dissolve multiple genders results as you showed in your question:

                                                           genres
Action                                                   5.837500
Action|Adventure                                         6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy         7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi         6.100000
Action|Adventure|Biography|Crime|History|Western         6.300000
Action|Adventure|Biography|Drama|History                 7.700000

A workaround is duplicating rows when more than one gender is found. Using a combination of split with the expand method set to True, you can create multiple dataframes and then stack them. For example, a movie with 2 genres will appear in 2 of the resultant dataframes, where each dataframe represent the movies assigned to each genre. Finally, after parsing, you can aggregate by gender with multiple functions. I will explain step by step:

1. Get top 250 movies (by score)

Load data:

import pandas as pd
import numpy as np

headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres'] 
movies = pd.read_csv("imdb_top_10000.txt", sep="	", header=None, names=headers, encoding='UTF-8')

Notice that you have null values in genres field:

             imdbID             title  year  score  votes   runtime genres
7917  tt0990404  Chop Shop (2007)  2007    7.2   2104  84 mins.    NaN

Since aggregation methods with Pandas will omit rows with any null value, and we have only 1 movie with a null value on this field, can be set manually (checked on Imdb):

movies.loc[movies.genres.isnull(),"genres"] = "Drama"

Now, as you already showed, we need the top 250 movies by score:

movies = movies.sort_values('score', ascending=False).head(250)

2. Create a genre field from genres using split with expand

2.1. Set index

Keep only genres field as column, keep other fields as index. This is for ease on working over genres.

movies = movies.set_index(movies.columns.drop('genres',1).tolist())                                                                         
                                                                                     genres
imdbID    title                                 year score votes  runtime                  
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Crime|Drama
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.         Western
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Crime|Thriller
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.     Comedy|Drama
(250, 1)

2.2. Split by genres

This will create N dataframes from N iterations of split.

movies = movies.genres.str.split('|',expand=True)

                                                                                       0  \
imdbID    title                                 year score votes  runtime              
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.    Crime   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.    Crime   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.  Western   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.    Crime   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.    Comedy   

                                                                                    1  \
imdbID    title                                 year score votes  runtime               
tt0111161 The Shawshank Redemption (1994)       1994 9.2   619479 142 mins.     Drama   
tt0068646 The Godfather (1972)                  1972 9.2   474189 175 mins.     Drama   
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0   195238 161 mins.      None   
tt0110912 Pulp Fiction (1994)                   1994 9.0   490065 154 mins.  Thriller   
tt0252487 Outrageous Class (1975)               1975 9.0   9823   87 mins.      Drama   

...

2.3. Stack

Now you have a unique value of genre for each movie, where a movie can have more than 1 row if more than 1 genre was assigned, you can stack the set of dataframes. Notice that now we have more than 250 rows (662 rows), but are 250 distinct movies.

movies = movies.stack()

imdbID     title                                  year  score  votes   runtime     
tt0111161  The Shawshank Redemption (1994)        1994  9.2    619479  142 mins.  0      Crime
                                                                                  1      Drama
tt0068646  The Godfather (1972)                   1972  9.2    474189  175 mins.  0      Crime
                                                                                  1      Drama
tt0060196  The Good, the Bad and the Ugly (1966)  1966  9.0    195238  161 mins.  0    Western
dtype: object
(662,)

3. Parse

Get suitable data structure before aggregation:

# Multiple index to columns
movies = movies.reset_index()

# Name the new column for genre
movies = movies.rename(columns={0:'genre'})

# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]

     genre  score   votes
0    Crime    9.2  619479
1    Drama    9.2  619479
2    Crime    9.2  474189
3    Drama    9.2  474189
4  Western    9.0  195238
(662, 3)

4. Aggregate

As you requested, score must be aggregated by mean and votes by sum:

movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})

              score    votes
               mean      sum
genre                      
Action     8.425714  7912508
Adventure  8.430000  7460632
Animation  8.293333  1769806
Biography  8.393750  2112875
Comedy     8.341509  3166269
(21, 2)

Aggregation on one-hot-encoded dataframes

Answers (2)