Reputation: 83
movie_dataset = {'Avatar': [0.01940156245995175, 0.4812286689419795, 0.9213483146067416], "Pirates of the Caribbean: At World's End": [0.02455894456664483, 0.45051194539249145, 0.898876404494382], 'Spectre': [0.02005646812429373, 0.378839590443686, 0.9887640449438202], ... }
movie_ratings = {'Avatar': 7.9, "Pirates of the Caribbean: At World's End": 7.1, 'Spectre': 6.8, ...}
def distance(movie1, movie2):
squared_difference = 0
for i in range(len(movie1)):
squared_difference += (movie1[i] - movie2[i]) ** 2
final_distance = squared_difference ** 0.5
return final_distance
def predict(unknown, dataset, movie_ratings, k):
distances = []
#Looping through all points in the dataset
for title in dataset:
movie = dataset[title]
distance_to_point = distance(movie, unknown)
#Adding the distance and point associated with that distance
distances.append([distance_to_point, title])
distances.sort()
#Taking only the k closest points
neighbors = distances[0:k]
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i] <----- Why is this an error?
return total_rating / len(neighbors) <----- Why can I not divide by total rating
#total_rating = 0
#for i in neighbors:
# title = neighbors[1]
#total_rating += movie_ratings[title] <----- Why is this not an error?
#return total_rating / len(neighbors)
print(movie_dataset["Life of Pi"])
print(movie_ratings["Life of Pi"])
print(predict([0.016, 0.300, 1.022], movie_dataset, movie_ratings, 5))
Two questions here. First, why is this an error?
for i in neighbors[1]:
total_rating += movie_ratings[i]
It seems to be the same as
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
Second, why can I not divide by len(total_rating)?
Upvotes: 0
Views: 49
Reputation: 61526
Second question first, because it's more straightforward:
Second, why can I not divide by len(total_rating)?
You're trying to compute an average, right? So you want the sum of the ratings divided by the number of ratings?
Okay. So, you're trying to figure out how many ratings there are. What's the rule that tells you that? It seems like you're expecting to count up the ratings from where they are stored. Where are they stored? It is not total_rating
; that's where you stored the numerical sum. Where did the ratings come from? They came from looking up the names of movies in the movie_ratings
. So the ratings were not actually stored at all; there is nothing to measure the len
of. Right? Well, not quite. What is the rule that determines the ratings we are adding up? We are looking them up in the movie_ratings
by title. So how many of them are there? As many as there are titles. Where were the titles stored? They were paired up with distances in the neighbors
. So there are as many titles as there are neighbors (whatever "neighbor" is supposed to mean here; I don't really understand why you called it that). So that is what you want the len()
of.
Onward to fixing the summation.
total_rating = 0
for i in neighbors[1]:
total_rating += movie_ratings[i]
First, this computes neighbors[1]
, which will be one of the [distance_to_point, title]
pairs that was .append
ed to the list (assuming there are at least two such values, to make the [1]
index valid).
Then, the loop iterates over that two-element list, so it runs twice: the first time, i
is equal to the distance value, and the second time it is equal to the title. An error occurs because the title is a string and you try to do math with it.
total_rating = 0
for i in neighbors:
title = neighbors[1]
total_rating += movie_ratings[title]
This loop makes i
take on each of the pairs as a value. The title = neighbors[1]
is broken; now we ignore the i
value completely and instead always use a specific pair, and also we try to use the pair (which is a list) as a title (we need a string).
What you presumably wanted is:
total_rating = 0
for neighbor in neighbors:
title = neighbor[1]
total_rating += movie_ratings[title]
Notice I use a clearer name for the loop variable, to avoid confusion. neighbor
is one of the values from the neighbors
list, i.e., one of the distance-title pairs. From that, we can get the title, and then from the ratings data and the title, we can get the rating.
I can make it clearer, by using argument unpacking:
total_rating = 0
for neighbor in neighbors:
distance, title = neighbor
total_rating += movie_ratings[title]
Instead of having to understand the reason for a [1]
index, now we label each part of the neighbor
value, and then use the one that's relevant for our purpose.
I can make it simpler, by doing the unpacking right away:
total_rating = 0
for distance, title in neighbors:
total_rating += movie_ratings[title]
I can make it more elegant, by not trying to explain to Python how to do sums, and just telling it what to sum:
total_rating = sum(movie_ratings[title] for distance, title in neighbors)
This uses a generator expression along with the built-in sum function, which does exactly what it sounds like.
Upvotes: 1
Reputation: 15268
distances is generated in the form:
[
[0.08565491616637051, 'Spectre'],
[0.1946446017955758, "Pirates of the Caribbean: At World's End"],
[0.20733104650812437, 'Avatar']
]
which is what neighbors is derived from, and the names are in position 1 of each list.
neighbors[1]
would just retrieve [0.1946446017955758, "Pirates of the Caribbean: At World's End"]
, or a single element, which doesn't look like is what you want. It would try to use 0.19...
and Pirates...
as keys in dict movie_ratings
.
I'm guessing you want this, to average all the ratings of the closest by the extracted distance values from dataset?:
for dist, name in neighbors:
total_rating += movie_ratings[name]
return total_rating / len(neighbors)
Upvotes: 1