NearestNeighbors tradeoff - run faster with less accurate results

Question

I'm working with a medium size dataset (shape=(14013L, 46L)). I want to smooth each sample with its knn. I'm training my model with:

NearestNeighbors(n_neighbors, algorithm='ball_tree',    
    metric=sklearn.metrics.pairwise.cosine_distances)

And the smooth is as follows:

def smooth(x,nbrs,data,alpha):
    """
    input:
        alpha: the smoothing factor
        nbrs: trained NearestNeighbors from sklearn
        data: the original data 
             (since NearestNeighbors returns only the index and not the samples)
        x:    what we want to smooth
    output:
        smoothed x with its nearest neighbours
    """
    distances, indices = nbrs.kneighbors(x)
    distances = map(lambda z:abs(-z+1),distances)[0]
    norm = sum(distances)
    if norm == 0:
        "No neighbours were found."
        return x
    distances = map(lambda z: (1-alpha)*z/norm ,distances)
    indices = map(lambda z: data[z],indices)[0]
    other = np.array([indices[i] * distances[i] for i in range(len(distances))])
    z = x * alpha
    z = z.reshape((1,z.shape[0]))
    smoothed = sum(np.concatenate((other,z),axis=0))
    return smoothed

My questions:

How is it possible that no neighbors were found ?(I experienced it on my dataset hence the if condition)
The fitting ("training") takes 18 seconds, but smoothing ~1000 samples takes more than 20 minutes. I'm willing to get less accurate results, if the smoothing process will be shorter. Is it possible? which parameters should I change in order to achieve it?

gsamaras · Accepted Answer

First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too.

I will approach your question from a theoretic point of view. The radius parameter is set 1.0 by default. This might be too small for your data set, since if I understand correctly, this will specify the radius from the query to look into. So, I would suggest to run your code with increasing this value, until you get some results. Then decrease it, until you get no results. Increase it a bit again and you have the optimal parameter for your data set.

An important parameter is leaf_size, which actually affects how many points you are going to check when a query arrives. A small value for this parameter might yield faster execution, with less precision.

You might also want to check this answer of mine, that explains the trade off between speed and accuracy, a trade off that is critical to understand when doing Nearest Neighbor search.

NearestNeighbors tradeoff - run faster with less accurate results

Answers (2)

Related Questions