Python: Why do I not need 2 variables when unpacking a dictionary?

Question

movie_dataset = {'Avatar': [0.01940156245995175, 0.4812286689419795, 0.9213483146067416], "Pirates of the Caribbean: At World's End": [0.02455894456664483, 0.45051194539249145, 0.898876404494382], 'Spectre': [0.02005646812429373, 0.378839590443686, 0.9887640449438202], ... }

movie_ratings = {'Avatar': 7.9, "Pirates of the Caribbean: At World's End": 7.1, 'Spectre': 6.8, ...}

def distance(movie1, movie2):
  squared_difference = 0
  for i in range(len(movie1)):
    squared_difference += (movie1[i] - movie2[i]) ** 2
  final_distance = squared_difference ** 0.5
  return final_distance

def predict(unknown, dataset, movie_ratings, k):
  distances = []
  #Looping through all points in the dataset
  for title in dataset:
    movie = dataset[title]
    distance_to_point = distance(movie, unknown)
    #Adding the distance and point associated with that distance
    distances.append([distance_to_point, title])
  distances.sort()
  #Taking only the k closest points
  neighbors = distances[0:k]
  total_rating = 0
  for i in neighbors[1]:
    total_rating += movie_ratings[i]  <----- Why is this an error?
  return total_rating / len(neighbors) <----- Why can I not divide by total rating
  #total_rating = 0
  #for i in neighbors:
    # title = neighbors[1]
    #total_rating += movie_ratings[title]  <----- Why is this not an error?
  #return total_rating / len(neighbors)

print(movie_dataset["Life of Pi"])
print(movie_ratings["Life of Pi"])
print(predict([0.016, 0.300, 1.022], movie_dataset, movie_ratings, 5))

Two questions here. First, why is this an error?

for i in neighbors[1]:

    total_rating += movie_ratings[i]

It seems to be the same as

for i in neighbors:
    title = neighbors[1]
    total_rating += movie_ratings[title]

Second, why can I not divide by len(total_rating)?

Karl Knechtel · Accepted Answer

Second question first, because it's more straightforward:

Second, why can I not divide by len(total_rating)?

You're trying to compute an average, right? So you want the sum of the ratings divided by the number of ratings?

Okay. So, you're trying to figure out how many ratings there are. What's the rule that tells you that? It seems like you're expecting to count up the ratings from where they are stored. Where are they stored? It is not total_rating; that's where you stored the numerical sum. Where did the ratings come from? They came from looking up the names of movies in the movie_ratings. So the ratings were not actually stored at all; there is nothing to measure the len of. Right? Well, not quite. What is the rule that determines the ratings we are adding up? We are looking them up in the movie_ratings by title. So how many of them are there? As many as there are titles. Where were the titles stored? They were paired up with distances in the neighbors. So there are as many titles as there are neighbors (whatever "neighbor" is supposed to mean here; I don't really understand why you called it that). So that is what you want the len() of.

Onward to fixing the summation.

total_rating = 0
for i in neighbors[1]:
    total_rating += movie_ratings[i]

First, this computes neighbors[1], which will be one of the [distance_to_point, title] pairs that was .appended to the list (assuming there are at least two such values, to make the [1] index valid).

Then, the loop iterates over that two-element list, so it runs twice: the first time, i is equal to the distance value, and the second time it is equal to the title. An error occurs because the title is a string and you try to do math with it.

total_rating = 0
for i in neighbors:
    title = neighbors[1]
    total_rating += movie_ratings[title]

This loop makes i take on each of the pairs as a value. The title = neighbors[1] is broken; now we ignore the i value completely and instead always use a specific pair, and also we try to use the pair (which is a list) as a title (we need a string).

What you presumably wanted is:

total_rating = 0
for neighbor in neighbors:
    title = neighbor[1]
    total_rating += movie_ratings[title]

Notice I use a clearer name for the loop variable, to avoid confusion. neighbor is one of the values from the neighbors list, i.e., one of the distance-title pairs. From that, we can get the title, and then from the ratings data and the title, we can get the rating.

I can make it clearer, by using argument unpacking:

total_rating = 0
for neighbor in neighbors:
    distance, title = neighbor
    total_rating += movie_ratings[title]

Instead of having to understand the reason for a [1] index, now we label each part of the neighbor value, and then use the one that's relevant for our purpose.

I can make it simpler, by doing the unpacking right away:

total_rating = 0
for distance, title in neighbors:
    total_rating += movie_ratings[title]

I can make it more elegant, by not trying to explain to Python how to do sums, and just telling it what to sum:

total_rating = sum(movie_ratings[title] for distance, title in neighbors)

This uses a generator expression along with the built-in sum function, which does exactly what it sounds like.

Python: Why do I not need 2 variables when unpacking a dictionary?

Answers (2)

Related Questions