yangze zhao
yangze zhao

Reputation: 109

confused with the output of sklearn.neighbors.NearestNeighbors

Here is the code.

from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)


>indices

>array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]])

>distances

>array([[0.        , 1.        ],[0.        , 1.        ],[0.        , 1.41421356], [0.        , 1.        ],[0.        , 1.        ],[0.        , 1.41421356]])

I don't really understand the shape of 'indices' and 'distances'. How do I understand what these numbers mean?

Upvotes: 7

Views: 6741

Answers (3)

athina.bikaki
athina.bikaki

Reputation: 809

I will comment to the aforementioned, how you can get the "n_neighbors=2" neighbors using the indices array, in a pandas dataframe. So,

import pandas as pd

df = pd.DataFrame([X.iloc[indices[row,col]] for row in range(indices.shape[0]) for col in range(indices.shape[1])])

Upvotes: 1

caverac
caverac

Reputation: 1637

Maybe a little sketch will help

enter image description here

As an example, the closest point to the training sample with index 0 is 1, and since you are using n_neighbors = 2 (two neighbors) you would expect to see this pair in the results. And indeed you see that the pair [0, 1] appears in the output.

Upvotes: 4

Vivek Kumar
Vivek Kumar

Reputation: 36599

Its pretty straightforward actually. For each data sample in the input to kneighbors() (X here), it will show 2 neighbors. (Because you have specified n_neighbors=2. The indices will give you the index of training data (again X here) and distances will give you the distance for the corresponding data point in training data (to which the indices are referring).

Take an example of single data point. Assuming X[0] as the first query point, the answer will be indices[0] and distances[0]

So for X[0],

  • the index of first nearest neighbor in training data is indices[0, 0] = 0 and distance is distances[0, 0] = 0. You can use this index value to get the actual data sample from the training data.

    This makes sense, because you used the same data for training and testing, so the first nearest neighbor for each point is itself and the distance is 0.

  • the index of second nearest neigbor is indices[0, 1] = 1 and distance is distances[0, 1] = 1

Similarly for all other points. The first dimension in indices and distances correspond to the query points and second dimension to the number of neighbors asked.

Upvotes: 10

Related Questions