Efficient k-nearest neighbor of all pairs in a pandas dataframe

Question

I have a pandas dataframe with 20K rows and 50 columns. I want to find the 5 nearest neighbors of each row within this data frame based on euclidian distance of columns. So the result is a matrix of 20K * 5 where columns are ids of the nearest neighbors in the dataframe.

I'm looking for a solution to do this as efficient as possible, preferably using indices provided by pandas, parallel operations, or vectorized operations. Scipy kd-tree was quite slow.

Any idea?

natemcintosh · Accepted Answer

It does indeed seem like Scipy's kd-tree is slow for your case; it took about 80ms to query a single point, which I'm guessing leads to around 0.08 * 20_000 = 1600s of total compute time for your full data set.

Another option for higher dimensional data (such as a data-set with 50 columns) might be the Ball Tree data structure. As the page in the link says:

Because of the spherical geometry of the ball tree nodes, it can out-perform a KD-tree in high dimensions, though the actual performance is highly dependent on the structure of the training data.

Playing around with the following code:

from sklearn.neighbors import NearestNeighbors
import numpy as np

arr = np.random.rand(20_000, 50) * 20
nbrs = NearestNeighbors(n_neighbors = 5, algorithm = 'ball_tree').fit(arr)

%timeit nbrs.kneighbors(arr[:10, :])
# 24.6 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nbrs.kneighbors(arr[:100, :])
# 209 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nbrs.kneighbors(arr[:1000, :])
# 2.02 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Looking at these %timeit results, it seems like the algorithm scales roughly linearly, so for 20k rows you can probably expect it to take about 20_000 / 1_000 * 2 = ~40s. 40 seconds is a lot faster than the ~1600s you can most likely expect from the kd-tree data structure.

Finally, I'd definitely suggest giving that nearest neighbors page a thorough read so you fully understand all of intricacies of the algorithms they offer.

Efficient k-nearest neighbor of all pairs in a pandas dataframe

Answers (1)

Related Questions