Reputation: 4173
I'm working with a medium size dataset (shape=(14013L, 46L)
).
I want to smooth each sample with its knn.
I'm training my model with:
NearestNeighbors(n_neighbors, algorithm='ball_tree',
metric=sklearn.metrics.pairwise.cosine_distances)
And the smooth is as follows:
def smooth(x,nbrs,data,alpha):
"""
input:
alpha: the smoothing factor
nbrs: trained NearestNeighbors from sklearn
data: the original data
(since NearestNeighbors returns only the index and not the samples)
x: what we want to smooth
output:
smoothed x with its nearest neighbours
"""
distances, indices = nbrs.kneighbors(x)
distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)
if norm == 0:
"No neighbours were found."
return x
distances = map(lambda z: (1-alpha)*z/norm ,distances)
indices = map(lambda z: data[z],indices)[0]
other = np.array([indices[i] * distances[i] for i in range(len(distances))])
z = x * alpha
z = z.reshape((1,z.shape[0]))
smoothed = sum(np.concatenate((other,z),axis=0))
return smoothed
My questions:
if
condition)Upvotes: 1
Views: 387
Reputation: 2487
Without your dataset and complete code, it is difficult to say for certain. Here is what I think.
distances = map(lambda z:abs(-z+1),distances)[0]
norm = sum(distances)
Because you index the result of the map, you should only be getting the first neighbor. If x
is actually one of the points you used to train with, then the first nearest neighbor will be...x
. Because you are using the cosine distance, that distance is exactly: 1. abs(1-1) == 0
. Before I suggest a correction, let's speak of performance.
With regard to performance: you are using the map
function everywhere, which is a Python builtin. Scikit-learn is built on numpy
, which is designed to do math much faster than native Python code. That is why training is so much faster than your code. You should be using numpy arithmetic instead of map
. One example: this line
distances = map(lambda z: (1-alpha)*z/norm ,distances)
should be this
distances *= ((1-alpha)/norm)
If norm
is an array, it should have the right dimensions for numpy broadcasting rules to kick in and "the right thing" be done, entirely in C.
Okay, so since I'm suggesting using numpy arrays (instead of map
and Python lists), I believe the correct thing for the two lines before your if
statement is to drop the indexing. (The rest of your code is probably broken by the indexing as well; after the second line of your function, distances
is not an array or a list, but a scalar.)
distances = np.abs( distances-1 )
norm = np.sum(distances)
Additionally, you should not call your smooth()
function lots of times, once for each sample. You should call it on an N_samples
by N_dimensions==46
numpy array, and adjust your smooth()
code appropriately. (NearestNeighbors, for example, will return a N_samples
by N_neighbors
array much more quickly than returning N_samples
individual arrays of length N_samples
.)
Upvotes: 1
Reputation: 73414
First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too.
I will approach your question from a theoretic point of view. The radius
parameter is set 1.0 by default. This might be too small for your data set, since if I understand correctly, this will specify the radius from the query to look into. So, I would suggest to run your code with increasing this value, until you get some results. Then decrease it, until you get no results. Increase it a bit again and you have the optimal parameter for your data set.
An important parameter is leaf_size
, which actually affects how many points you are going to check when a query arrives. A small value for this parameter might yield faster execution, with less precision.
You might also want to check this answer of mine, that explains the trade off between speed and accuracy, a trade off that is critical to understand when doing Nearest Neighbor search.
Upvotes: 1