NEE2000
NEE2000

Reputation: 51

What is the best method for optimizing runtime of kd-tree in Python?

I'm currently using Python's scipy.spatial.kdtree to perform nearest neighbor lookups between two large sets of earth science data. One is a collection of storm reports that have a specific lat/lon attached; the other is 1x1 km gridded data containing land use data for half of the United States.

I've performed kd-tree operations on similar datasets which had roughly 4.4 * 10 ^ 7 points to sort in the kd-tree and that successfully sorted in approximately 160 seconds; however, when I try to build a kd-tree with this dataset (has approximately 1.6 * 10 ^ 8 points to sort), my kernel simply times out. I'm aware that a kd-tree runs at Olog(n) runtime, though I'm not too familiar with the finer workings of big-O notation, so I'm unsure as to whether or not this should cause an exponential increase in runtime.

Is this likely due to machine timeouts that could be optimized through better data partitioning prior to building the kd-tree, or does this seem to be somewhat of a fluke?

Thanks in advance!

Upvotes: 0

Views: 956

Answers (1)

denis
denis

Reputation: 21947

Can you put your data on a regular grid, 1 km squares with high values over oceans ? Then scipy.interpolate.RegularGridInterpolator is waaay faster than KDTree -- there's no "build tree" at all. (The lower 48 states run from latitudes about 25 to 49 degrees, longitudes -125 to -67, so less than 17 million 1 km squares -- should fit, use np.float32.)

Also you could try KDtree( data, leafsize=100 ): faster build tree, slower queries. How many query points, storm reports, are there ?

Upvotes: 1

Related Questions