Alan De Maman
Alan De Maman

Reputation: 33

Pairwise distance in very large datasets

I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.

The ideia should be:

Start new_array with first row from data array

Compare new_array with the second row from data array

If pdist between they are > tol, append row to new_array

Compare new_array with the third row from data array

and so on...

One problem is RAM size. I cant compare all rows at once even in pdist.

So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.

I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.

xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
    if (pdist(np.vstack([np.array(ng),z]))>tol).all():
        ng.append(z)

Any suggestions for this problem?

EDIT

ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
    ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
    if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
        btsem.append(j)

This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)

Upvotes: 0

Views: 1040

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50488

The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n) where n is len(xyTotal) and k is the probability of the condition being true. Thus, assuming k=0.1 and n=5000000, the running time will be huge (likely hours of computation).

Hopefully, you can write a better implementation running in O(n * log(n)) time. However, this is tricky to implement. You need to add your ng points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol.

Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).

Upvotes: 1

Related Questions