Eelco
Eelco

Reputation: 61

MemoryError in Python while using cKDTree().query_ball_tree

I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.

I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints):
    KDsearch = cKDTree(AllXyPoints)  
    return KDsearch.query_ball_tree(KDsearch, Maxdist)

My questions:

1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.

2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?

Thanks for your incoming help and advice.

Upvotes: 6

Views: 2049

Answers (1)

JaminSore
JaminSore

Reputation: 3936

I experienced a MemoryError with SciPy's cKDTree during construction and scikit-learn's KDTree when calling .query_radius(). I found that Scikit-learn's BallTree was more memory efficient, and using a BallTree solved the issue for me. I've tested BallTree with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError.

Queries on a BallTree will not be as fast as a KDTree since your data is 2D, and BallTrees are slower than KDTrees when d <= 3 (see explanation here). However, given that cKDtree and scikit-learn's KDTree both raise MemorErrors (on my system, anyway), the simplest solution is to use a BallTree.

from sklearn.neighbors import BallTree
import numpy as np

max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)

neighbors = ball_tree.query_radius(points, max_dist)

Depending on your Maxdist, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius() returns an np.array of np.arrays rather than a list of lists so it should save you some memory there (see this answer for an explanation).

Upvotes: 6

Related Questions