Pairwise distance in very large datasets

Question

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.

The ideia should be:

Start new_array with first row from data array

Compare new_array with the second row from data array

If pdist between they are > tol, append row to new_array

Compare new_array with the third row from data array

and so on...

One problem is RAM size. I cant compare all rows at once even in pdist.

So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.

I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.

xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
    if (pdist(np.vstack([np.array(ng),z]))>tol).all():
        ng.append(z)

Any suggestions for this problem?

EDIT

ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
    ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
    if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
        btsem.append(j)

This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)

Pairwise distance in very large datasets

Answers (1)

Related Questions