random_dsp_guy
random_dsp_guy

Reputation: 246

Computing Nearest neighbor graph using sklearn?

This question is about creating a K-nearest neighbor graph [KNNG] from a dataset with an unknown number of centroids (which is not the same as K-means clustering).

Suppose that you have a dataset of observations stored in a data matrix X[n_samples, n_features] with each row being an observation or feature vector and each column being a feature. Now suppose you want to compute the (weighted) k-Neighbors graph for points in X using sklearn.neighbors.kneighbors_graph.

What are the basic methods to pick the number of neighbors to use for each sample? What algorithms scale well when you have lots of observations?

I have seen this brute force method below but it doesn't do well when the sample dataset size becomes large and you have to pick a good starting upper bound for n_neighbors_max. Does this algorithm have a name?

def autoselect_K(X, n_neighbors_max, threshold):
    # get the pairwise euclidean distance between every observation
    D = sklearn.metrics.pairwise.euclidean_distances(X, X)
    chosen_k = n_neighbors_max
    for k in range(2, n_neighbors_max):
        k_avg = []
        # loop over each row in the distance matrix
        for row in D:
            # sort the row from smallest distance to largest distance
            sorted_row = numpy.sort(row)
            # calculate the mean of the smallest k+1 distances
            k_avg.append(numpy.mean(sorted_row[0:k]))
        # find the median of the averages
        kmedian_dist = numpy.median(k_avg)
        if kmedian_dist >= threshold:
            chosen_k = k
            break
    # return the number of nearest neighbors to use
    return chosen_k

Upvotes: 1

Views: 1780

Answers (1)

navneeth
navneeth

Reputation: 1317

  • From your code, it appears that you are looking for a classification result based on the nearest neighbour. In such a case your search over the distance matrix is akin to a brute force search and defeats the purpose of Nearest neighbour algorithms.

Perhaps what you are looking for is the NNClassifier. Here https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

  • Regarding the choice of the number of nearest neighbours, this depends on the sparsity of your data. It helps to view Nearest Neighbour as a way to bound your search. Rather than look over all samples. It will allow you to narrow the search to the top-N (nearest neighbour) samples. Afterward you can apply a domain specific technique on these N samples to get the desired result.

Upvotes: 2

Related Questions