Quentin Fortier
Quentin Fortier

Reputation: 163

Reshape the structure of a dataframe

In a dataframe df containing points (row) and coordinates (columns), I want to compute, for each point, the n closest neighbors points and the corresponding distances.

I did something like this:

df = pd.DataFrame(np.random.rand(4, 6))

def dist(p, q):  
    return ((p - q)**2).sum(axis=1)

def f(s):
    closest = dist(s, df).nsmallest(3)
    return list(closest.index) + list(closest) 

df.apply(f, axis=1, result_type="expand")

which gives:

     0    1    2    3         4         5
0  0.0  3.0  2.0  0.0  0.743722  1.140251
1  1.0  2.0  0.0  0.0  1.548676  1.695104
2  2.0  3.0  0.0  0.0  0.702797  1.140251
3  3.0  2.0  0.0  0.0  0.702797  0.743722

(first 3 columns are the indices of the closest points, the next 3 columns are the corresponding distances)

However, I would prefer to get a dataframe with 3 columns: point, closest point to it, distance between them. Put another way: I want one column per distance, and not one column per point.

I tried pd.melt, pd.pivot but without finding any good way to do it...

Upvotes: 0

Views: 124

Answers (1)

Bill Huang
Bill Huang

Reputation: 4648

Option 1: Scikit-learn NearestNeighbors class

To find k-nearest-neighbors (kNN), sklearn.neighbors.NearestNeighbors serves the purpose.

Data

import numpy as np
import pandas as pd

np.random.seed(52)  # reproducibility
df = pd.DataFrame(np.random.rand(4, 6))

print(df)
          0         1         2         3         4         5
0  0.823110  0.026118  0.210771  0.618422  0.098284  0.620131
1  0.053890  0.960654  0.980429  0.521128  0.636553  0.764757
2  0.764955  0.417686  0.768805  0.423202  0.926104  0.681926
3  0.368456  0.858910  0.380496  0.094954  0.324891  0.415112

Code

from sklearn.neighbors import NearestNeighbors

k = 3
dist, indices = NearestNeighbors(n_neighbors=k).fit(df).kneighbors(df)

Result

print(dist)
array([[0.00000000e+00, 1.09330867e+00, 1.13862254e+00],
       [0.00000000e+00, 9.32862532e-01, 9.72369661e-01],
       [0.00000000e+00, 9.72369661e-01, 1.02130721e+00],
       [2.10734243e-08, 9.32862532e-01, 1.02130721e+00]])

print(indices)
array([[0, 2, 3],
       [1, 3, 2],
       [2, 1, 3],
       [3, 1, 2]])

The obtained distances and indices can be easily rearranged.

Option 2: compute manually (nearest except self)

sklearn.metrics has a built-in Euclidean distance function, which outputs an array of shape [#rows x #rows]. You can exclude the diagonal elements (distance to itself, namely 0) from min() and argmin() by filling it with infinity.

Code

from sklearn.metrics import euclidean_distances

dist = euclidean_distances(df.values, df.values)
np.fill_diagonal(dist, np.inf)  # exclude self from min()

df_want = pd.DataFrame({
    "point": range(df.shape[0]),
    "closest_point": dist.argmin(axis=1),
    "distance": dist.min(axis=1)    
})

Result

print(df_want)
   point  closest_point  distance
0      0              2  1.093309
1      1              3  0.932863
2      2              1  0.972370
3      3              1  0.932863

Upvotes: 1

Related Questions