Reputation: 6426
I've been using spatial.cKDTree
in scipy
to calculate distances between points. It has always run very quickly (~1 s) for my typical data sets (finding distances for ~1000 points to an array of ~1e6 points).
I'm running this code in python 2.7.6 on a computer with Ubuntu 14.10. Up until this morning, I had managed most python packages with apt-get
, including scipy
and numpy
. I wanted up-to-date versions of a few packages though, so I decided to packages installed in /usr/lib/python2.7/
by apt-get
, and re-installed all packages with pip install
(taking care of scipy
dependencies like liblapack-dev
with apt-get
, as necessary). Everything installed and is importable without a problem.
import scipy
import cython
scipy.__version__
'0.16.0'
cython.__version__
'0.22.1'
Now, running spatial.cKDTree
on the same size data sets is going really slowly. I'm seeing run time of ~500 s rather than ~1 s. I'm having trouble figuring out what is going on.
Any suggestions as to what I might have done in installing using pip
rather than apt-get
that would have caused scipy.spatial.cKDTree
to run so slowly?
Upvotes: 3
Views: 3508
Reputation: 1144
In 0.16.x
I added options to build the cKDTree
with median or sliding midpoint rules, as well as choosing whether to recompute the bounding hyperrectangle at each node in the kd-tree. The defaults are based on experiences about the performance of scipy.spatial.cKDTree
and sklearn.neighbors.KDTree
. In some contrived cases (data that are highly streched along a dimension) it can have negative impact, but usually it should be faster. Experiment with bulding the cKDTree
with balanced_tree=False
and/or compact_nodes=False
. Setting both to False
gives you the same behavior as 0.15.x
. Unfortunately it is difficult to set defaults that make everyone happy because the performance depends on the data.
Also note that with balanced_tree=True
we compute medians by quickselect when the kd-tree is constructed. If the data for some reason is pre-sorted, it will be very slow. In this case it will help to shuffle the rows of the input data. Or you can set balanced_tree=False
to avoid the partial quicksorts.
There is also a new option to multithread the nearest-neighbor query. Try to call query
with n_jobs=-1
and see if it helps for you.
Update June 2020: SciPy 1.5.0 will use a new algorithm (introselect based partial sort, from C++ STL) which solves the problems reported here.
Upvotes: 16
Reputation: 1144
In the next release of SciPy, balanced kd-trees will be created with introselect instead of quickselect, which is much faster on structured datasets. If you use cKDTree on a structured data set such as an image or a grid, you can look forward to a major boost in performance. It is already available if you build SciPy from its master branch on GitHub.
Upvotes: 1