Ace Knight
Ace Knight

Reputation: 11

How to compute multiple euclidean distances of all points in a dataset?

I'm new to python, I have a dataset with hundreds of entries and i want to find the euclidean distance of 6th nearest neighbour for each point and save them.

the entries are like this:

362.240997 242.054993
505.821014 159.210007
420.803986 134.830002
504.035004 314.125000
356.670013 199.093994
326.545990 91.766998
214.477005 63.821999
351.351013 86.885002
216.041000 242.024994
441.700012 277.333008
68.678001 203.095001
547.051025 99.218002
405.983002 141.934006
402.239990 247.876007
197.134003 260.622009
163.141006 66.302002
561.950989 172.966995
340.036987 115.315002
63.076000 78.059998
261.072998 268.122009
319.376007 65.832001
.......

I have no idea where to start, I tried to look around but didn't understand anything because this is too specific. Any help is appreciated.

THANK YOU EVERYONE SO MUCH!

Upvotes: 1

Views: 5284

Answers (4)

Ouss
Ouss

Reputation: 3885

What you are trying to do is to create what is called in the business the pairwise distance matrix.

you can use scipy.spatial.distance.pdist function to easily achieve that, and using the scipy.spatial.distance.squareform will make the output easily readable.

from scipy.spatial.distance import pdist, squareform
import pandas as pd
#load the dataset in a panda DataFrame
df_dataset=pd.DataFrame(dataset)
# use the pdist() function to calculate the 
# Eucledian Distance between all pairs of rows in the dataframe
# and then pass the distances to the squareform() function that prints 
# out the result in a square format with rows and columns 
# corresponding to the points (row indexes of the original dataset).
squareform(pdist(df_dataset),columns=df_dataset.index,index=df_dataset.index)

And that's it

Upvotes: 0

vlemaistre
vlemaistre

Reputation: 3331

Here is an other way to do it only using python. I just use pandas to import the data. So, first of all create a csv out of your data :

import pandas

# Read your csv :
df = pd.read_csv('your_file.csv')

# Consider your points as tuples in a list
data = [(float(x),float(y)) for x, y in df[['x', 'y']].values ]

nearest_points = []
for point in data:
    # Compute the distance between the current point and all others
    distances = [math.sqrt((point[0]-x[0] )**2+ (point[1]-x[1])**2) for x in data]
    # Use np.argsort() to sort the array and keep the three closest points
    nearest_points.append([data[i] for i in np.argsort(distances)[1:4]])

Upvotes: 1

hashcode55
hashcode55

Reputation: 5870

Here is one simple way to achieve what you want using the sklearn.

>>> from sklearn.neighbors import NearestNeighbors
>>> import numpy as np
>>> values = [[1, 2], [2, 3], [4.5, 2.5], [1.5, 3], [5, 2], [8, 9], [10, 10]]
>>> nbrs = NearestNeighbors(n_neighbors=6, algorithm='ball_tree', metric='euclidean').fit()
>>> distances, indices = nbrs.kneighbors(values)
>>> distances[0]
array([0.        , 1.11803399, 1.41421356, 3.53553391, 4.        ,
       9.89949494])

distances[0] contains the euclidean distance from your 6 nearest neighbors to the first data point which is (1, 2). You can simply extract the last values from the complete result.

For more info, please refer sklearn documentation.

Edit To get the distances from the sixth neighbor for all data points:

>>> sixth_nnd = [d[5] for d in distances]
>>> sixth_nnd
[9.899494936611665, 8.48528137423857, 7.3824115301167, 8.845903006477066, 7.615773105863909, 8.845903006477066, 11.01135777277262]

You just need to save sixth_nnd in file.

Upvotes: 0

Aimery
Aimery

Reputation: 1793

First of all, you should read your input from the file and store every point in a list. Note your file may be considered a csv file using spaces instead of commas as separator. See the documentation for reading csv files in Python.

Next, I would suggest, if there aren't too many points, to compute the Euclidean distance between any two points and storing it in a 2D list, such that dist[i][j] contains the distance between point i and j. With n points, time complexity will be O(n²). You may optimize this step by computing only half the distances (since dist[i][j] and dist[j][i] are the same).

Then for each point, find the 6 nearest by looping over either a column or line of your distance list (remember, it is symmetrical) to find the smallest distances. That is: for a fixed value of i, find the six values of j that yield the smallest values of dist[i][j]. Or alternatively: for a fixed value of j, find the six values of i that yield the smallest values of dist[i][j].

Upvotes: 0

Related Questions