Reputation: 1
# Create random df
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
I am stumped on this problem and wondering if anyone can see where I'm going wrong. I am trying to calculate the euclidean distance between each row and every other row. Then, I sort those distances and return the index positions of the "most similar" rows by minimum distance in the list smallest_dist.
The issue is that this only returns the most similar index positions of the last row: [6.0, 3.0, 4.0]
What I want for output is something like this:
Original ID | Matches |
---|---|
1 | 4,5,6 |
2 | 8,2,5 |
I've tried this but it gives the same result:
list_of_mins = []
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
smallest_dist = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
for i in range(len(test)):
list_of_mins.append(smallest_dist_ixs)
Does anyone know what's causing this problem? thank you!
Upvotes: 0
Views: 219
Reputation: 194
I don't have the distance library available so I change that to a simple sum, but it should work after replacing it back to distance
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, size=(100, 23)))
test = df[:50]
dict_results = {'ids': [],
'ids_min': []}
n_min = 2
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: np.sum(row), axis=1)
# Create a new dataframe with distances.
# print(euclidean_distances)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances,
"idx": euclidean_distances.index})
selected_min = distance_frame.sort_values("dist").head(n_min)
dict_results['ids'].append(i)
dict_results['ids_min'].append(', '.join(selected_min['idx'].astype('str')))
print(pd.DataFrame(dict_results))
I added a few changes to your code:
n_min
parameter to define how many elements you want in the second columns (number of index to closest rows)distance_frame
Upvotes: 1
Reputation: 1480
What happens if you try to resturn the results either in the data frame or (for convenience of testing) a dictionary? For example:
df = pd.DataFrame(np.random.randint(1,10, size=(100,23)))
test = df[:50]
closest_nodes = {}
for i in range(len(test)):
query_node = test.iloc[i]
# Find the distance between this node and everyone else
euclidean_distances = test.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
closest_nodes[i] = [dist["idx"] for idx, dist in distance_frame.iloc[1:4].iterrows()]
The thing I didn't see in your code was some sort of storage action to put the one result per test case into a permanent structure.
Upvotes: 0