Reputation: 15
I'm very new to python and don't know how to proceed. I have a csv file with over 100k rows in following structure:
title,genres,rating
Lord of the Rings,Adventure|Animation|Children|Comedy|Fantasy,4.0
Lord of the Rings,Adventure|Animation|Children|Comedy|Fantasy,4.1
Star Wars,Adventure|Animation|Children|Comedy|Fantasy,4.5
Toy Story,Adventure|Animation|Children|Comedy|Fantasy,2.5
.
.
.
I need to analyze the number of titles for each genres and the average rating of each genres.
I have the csv.reader already but I don´t know how to count the titles per each genres and their average rating.
Thanks for every help!
Upvotes: 0
Views: 467
Reputation: 349
supposing the file has only these five lines, the code would be the following. Nonetheless, it will scale up for more lines as long as you implement the logic for reading your file (instead of creating a list at the start as I did):
file = ["title,genres,rating",
"Lord of the Rings,Adventure|Animation|Children|Comedy|Fantasy,4.0",
"Lord of the Rings,Adventure|Animation|Children|Comedy|Fantasy,4.1",
"Star Wars,Adventure|Animation|Children|Comedy|Fantasy,4.5",
"Toy Story,Adventure|Animation|Children|Comedy|Fantasy,2.5"]
header = True #First line is a header
dictionaryOfGenders = {} #Stores the list of ratings for each gender. The average will be computed later.
for line in file:
if header:
header=False
continue
else:
line = line.strip("\n").split(",") #line[0] is the title of the movie, line[1] is a list of genders with a separator "|", line[2] is the rating of the movie
genders = line[1].split("|")
rating = float(line[2])
for gender in genders: #Several genders are assigned to the same movie, each of them will append this rating to their list of ratings
if gender in dictionaryOfGenders:
dictionaryOfGenders[gender].append(rating)
else:
dictionaryOfGenders[gender] = [rating]
averageRating = {} #Storages the final average for each gender
for gender in dictionaryOfGenders.keys():
averageRating[gender] = sum(dictionaryOfGenders[gender])/len(dictionaryOfGenders[gender])
print (averageRating)
I hope you find it helpful :)
OUTPUT OF THIS SCRIPT:
{'Adventure': 3.775, 'Animation': 3.775, 'Children': 3.775, 'Comedy': 3.775, 'Fantasy': 3.775}
Upvotes: 0
Reputation: 18466
Split genres on |
, explode
it, groupby
genres, and use agg
as size
for title and mean
for rating.
df['genres']=df['genres'].str.split('|')
df = (df.explode('genres')
.groupby('genres')[['title', 'rating']]
.agg({'title':'size', 'rating':'mean'})
)
OUTPUT:
title rating
genres
Adventure 4 3.775
Animation 4 3.775
Children 4 3.775
Comedy 4 3.775
Fantasy 4 3.775
Upvotes: 1