Reputation: 231

How do I place NaN when computing the average rating for each movie in a DataFrame?

I am working with the MovieLens dataset, basically there are 2 files, a .csv file which contains movies and another .csv file which contains ratings given by n users to specific movies.

I did the following in order to get the average rating for each movie in the DF.

ratings_data.groupby('movieId').rating.mean()

however with that code I am getting 9724 movies whereas in the main DataFrame I have 9742 movies.
I think that there are movies that are not rated at all, however since I want to add the ratings to the main movies dataset how would I put NaN on the fields that have no ratings?!

Upvotes: 1

Answers (1)

jezrael

Reputation: 863501

Use Series.reindex by unique movieId form another column, for same order is add Series.sort_values:

movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')

mov = movies_data['movieId'].sort_values().drop_duplicates()  
df = ratings_data.groupby('movieId').rating.mean().reindex(mov).reset_index()
print (df)
      movieId    rating
0           1  3.920930
1           2  3.431818
2           3  3.259615
3           4  2.357143
4           5  3.071429
      ...       ...
9737   193581  4.000000
9738   193583  3.500000
9739   193585  3.500000
9740   193587  3.500000
9741   193609  4.000000

[9742 rows x 2 columns]

df1 = df[df['rating'].isna()]
print (df1)
      movieId  rating
816      1076     NaN
2211     2939     NaN
2499     3338     NaN
2587     3456     NaN
3118     4194     NaN
4037     5721     NaN
4506     6668     NaN
4598     6849     NaN
4704     7020     NaN
5020     7792     NaN
5293     8765     NaN
5421    25855     NaN
5452    26085     NaN
5749    30892     NaN
5824    32160     NaN
5837    32371     NaN
5957    34482     NaN
7565    85565     NaN

EDIT:

If need new column to movie_data DataFrame, use DataFrame.merge with left join:

movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')

df = ratings_data.groupby('movieId', as_index=False).rating.mean()
print (df)
      movieId    rating
0           1  3.920930
1           2  3.431818
2           3  3.259615
3           4  2.357143
4           5  3.071429
      ...       ...
9719   193581  4.000000
9720   193583  3.500000
9721   193585  3.500000
9722   193587  3.500000
9723   193609  4.000000

[9724 rows x 2 columns]

df = movies_data.merge(df, on='movieId', how='left')
print (df)
      movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
      ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres    rating  
0     Adventure|Animation|Children|Comedy|Fantasy  3.920930  
1                      Adventure|Children|Fantasy  3.431818  
2                                  Comedy|Romance  3.259615  
3                            Comedy|Drama|Romance  2.357143  
4                                          Comedy  3.071429  
                                          ...       ...  
9737              Action|Animation|Comedy|Fantasy  4.000000  
9738                     Animation|Comedy|Fantasy  3.500000  
9739                                        Drama  3.500000  
9740                             Action|Animation  3.500000  
9741                                       Comedy  4.000000  

[9742 rows x 4 columns]

Upvotes: 1

How do I place NaN when computing the average rating for each movie in a DataFrame?

Answers (1)

Related Questions