Reputation: 163
In a dataframe df containing points (row) and coordinates (columns), I want to compute, for each point, the n closest neighbors points and the corresponding distances.
I did something like this:
df = pd.DataFrame(np.random.rand(4, 6))
def dist(p, q):
return ((p - q)**2).sum(axis=1)
def f(s):
closest = dist(s, df).nsmallest(3)
return list(closest.index) + list(closest)
df.apply(f, axis=1, result_type="expand")
which gives:
0 1 2 3 4 5
0 0.0 3.0 2.0 0.0 0.743722 1.140251
1 1.0 2.0 0.0 0.0 1.548676 1.695104
2 2.0 3.0 0.0 0.0 0.702797 1.140251
3 3.0 2.0 0.0 0.0 0.702797 0.743722
(first 3 columns are the indices of the closest points, the next 3 columns are the corresponding distances)
However, I would prefer to get a dataframe with 3 columns: point, closest point to it, distance between them. Put another way: I want one column per distance, and not one column per point.
I tried pd.melt, pd.pivot but without finding any good way to do it...
Upvotes: 0
Views: 124
Reputation: 4648
To find k-nearest-neighbors (kNN), sklearn.neighbors.NearestNeighbors serves the purpose.
Data
import numpy as np
import pandas as pd
np.random.seed(52) # reproducibility
df = pd.DataFrame(np.random.rand(4, 6))
print(df)
0 1 2 3 4 5
0 0.823110 0.026118 0.210771 0.618422 0.098284 0.620131
1 0.053890 0.960654 0.980429 0.521128 0.636553 0.764757
2 0.764955 0.417686 0.768805 0.423202 0.926104 0.681926
3 0.368456 0.858910 0.380496 0.094954 0.324891 0.415112
Code
from sklearn.neighbors import NearestNeighbors
k = 3
dist, indices = NearestNeighbors(n_neighbors=k).fit(df).kneighbors(df)
Result
print(dist)
array([[0.00000000e+00, 1.09330867e+00, 1.13862254e+00],
[0.00000000e+00, 9.32862532e-01, 9.72369661e-01],
[0.00000000e+00, 9.72369661e-01, 1.02130721e+00],
[2.10734243e-08, 9.32862532e-01, 1.02130721e+00]])
print(indices)
array([[0, 2, 3],
[1, 3, 2],
[2, 1, 3],
[3, 1, 2]])
The obtained distances and indices can be easily rearranged.
sklearn.metrics
has a built-in Euclidean distance function, which outputs an array of shape [#rows x #rows]
. You can exclude the diagonal elements (distance to itself, namely 0
) from min()
and argmin()
by filling it with infinity.
Code
from sklearn.metrics import euclidean_distances
dist = euclidean_distances(df.values, df.values)
np.fill_diagonal(dist, np.inf) # exclude self from min()
df_want = pd.DataFrame({
"point": range(df.shape[0]),
"closest_point": dist.argmin(axis=1),
"distance": dist.min(axis=1)
})
Result
print(df_want)
point closest_point distance
0 0 2 1.093309
1 1 3 0.932863
2 2 1 0.972370
3 3 1 0.932863
Upvotes: 1