scipy.spatial.ckdtree running slowly

Question

I've been using spatial.cKDTree in scipy to calculate distances between points. It has always run very quickly (~1 s) for my typical data sets (finding distances for ~1000 points to an array of ~1e6 points).

I'm running this code in python 2.7.6 on a computer with Ubuntu 14.10. Up until this morning, I had managed most python packages with apt-get, including scipy and numpy. I wanted up-to-date versions of a few packages though, so I decided to packages installed in /usr/lib/python2.7/ by apt-get, and re-installed all packages with pip install (taking care of scipy dependencies like liblapack-dev with apt-get, as necessary). Everything installed and is importable without a problem.

import scipy
import cython
scipy.__version__
'0.16.0'
cython.__version__
'0.22.1'

Now, running spatial.cKDTree on the same size data sets is going really slowly. I'm seeing run time of ~500 s rather than ~1 s. I'm having trouble figuring out what is going on.

Any suggestions as to what I might have done in installing using pip rather than apt-get that would have caused scipy.spatial.cKDTree to run so slowly?

Sturla Molden · Accepted Answer

In 0.16.x I added options to build the cKDTree with median or sliding midpoint rules, as well as choosing whether to recompute the bounding hyperrectangle at each node in the kd-tree. The defaults are based on experiences about the performance of scipy.spatial.cKDTree and sklearn.neighbors.KDTree. In some contrived cases (data that are highly streched along a dimension) it can have negative impact, but usually it should be faster. Experiment with bulding the cKDTree with balanced_tree=False and/or compact_nodes=False. Setting both to False gives you the same behavior as 0.15.x. Unfortunately it is difficult to set defaults that make everyone happy because the performance depends on the data.

Also note that with balanced_tree=True we compute medians by quickselect when the kd-tree is constructed. If the data for some reason is pre-sorted, it will be very slow. In this case it will help to shuffle the rows of the input data. Or you can set balanced_tree=False to avoid the partial quicksorts.

There is also a new option to multithread the nearest-neighbor query. Try to call query with n_jobs=-1 and see if it helps for you.

Update June 2020: SciPy 1.5.0 will use a new algorithm (introselect based partial sort, from C++ STL) which solves the problems reported here.

scipy.spatial.ckdtree running slowly

Answers (2)

Related Questions