python_newuser
python_newuser

Reputation: 1

For loop only returning last item

# Create random df
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]  

for i in range(len(test)):
    query_node = test.iloc[i]
    # Find the distance between this node and everyone else
    euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
    # Create a new dataframe with distances.
    distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
    distance_frame.sort_values("dist", inplace=True)
    smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]

I am stumped on this problem and wondering if anyone can see where I'm going wrong. I am trying to calculate the euclidean distance between each row and every other row. Then, I sort those distances and return the index positions of the "most similar" rows by minimum distance in the list smallest_dist.

The issue is that this only returns the most similar index positions of the last row: [6.0, 3.0, 4.0]

What I want for output is something like this:

Original ID Matches
1 4,5,6
2 8,2,5

I've tried this but it gives the same result:

list_of_mins = []

for i in range(len(test)):
    query_node = test.iloc[i]
    # Find the distance between this node and everyone else
    euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
    # Create a new dataframe with distances.
    distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
    distance_frame.sort_values("dist", inplace=True)
    smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
    for i in range(len(test)):
        list_of_mins.append(smallest_dist_ixs)

Does anyone know what's causing this problem? thank you!

Upvotes: 0

Views: 219

Answers (2)

Phill
Phill

Reputation: 194

I don't have the distance library available so I change that to a simple sum, but it should work after replacing it back to distance

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, size=(100, 23)))
test = df[:50]

dict_results = {'ids': [],
                'ids_min': []}

n_min = 2

for i in range(len(test)):
    query_node = test.iloc[i]
    # Find the distance between this node and everyone else
    euclidean_distances = test.apply(lambda row: np.sum(row), axis=1)
    # Create a new dataframe with distances.
    # print(euclidean_distances)
    distance_frame = pd.DataFrame(data={"dist": euclidean_distances,
                                        "idx": euclidean_distances.index})

    selected_min = distance_frame.sort_values("dist").head(n_min)
    dict_results['ids'].append(i)
    dict_results['ids_min'].append(', '.join(selected_min['idx'].astype('str')))

print(pd.DataFrame(dict_results))

I added a few changes to your code:

  1. Added a n_min parameter to define how many elements you want in the second columns (number of index to closest rows)
  2. Created a dict where the results are going to be save to create the data frame you want.
  3. In the loop added the append to add the results of each iteration to the dict where the results are being saved
  4. After the loop if you call the dict inside pd.DataFrame it will be parse the same way you were doing it with the distance_frame

Upvotes: 1

Joffan
Joffan

Reputation: 1480

What happens if you try to resturn the results either in the data frame or (for convenience of testing) a dictionary? For example:

df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
closest_nodes = {}

for i in range(len(test)):
    query_node = test.iloc[i]
    # Find the distance between this node and everyone else
    euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
    # Create a new dataframe with distances.
    distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
    distance_frame.sort_values("dist", inplace=True)
    closest_nodes[i] = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]

The thing I didn't see in your code was some sort of storage action to put the one result per test case into a permanent structure.

Upvotes: 0

Related Questions