Reputation: 33
I have an array that is about [5000000 x 6] and I need to select only the points (rows) that are at a certain a distance from each other.
The ideia should be:
Start new_array with first row from data array
Compare new_array with the second row from data array
If pdist between they are > tol, append row to new_array
Compare new_array with the third row from data array
and so on...
One problem is RAM size. I cant compare all rows at once even in pdist.
So I've been thinking in split the dataset in smaller ones, but then i dont know how to retrieve the index information for the rows in dataset.
I've tried scipy cdist, scipy euclidean, sklearn euclidean_distances, sklearn paired_distances and the below code is the fastest i could get. At first it is fast but after 40k loops it becomes really slow.
xyTotal=np.random.random([5000000,6])
tol=0.5
for i,z in enumerate(xyTotal):
if (pdist(np.vstack([np.array(ng),z]))>tol).all():
ng.append(z)
Any suggestions for this problem?
EDIT
ktree = BallTree(xyTotal, leaf_size=40,metric='euclidean')
btsem=[]
for i,j in enumerate(xyTotal):
ktree.query_radius(j.reshape(1,-1),r=tol, return_distance=True)
if (ktree.query_radius(j.reshape(1,-1),r=tol, count_only=True))==1:
btsem.append(j)
This is fast but I'm only picking outliers. When i get to points that are near to anothers (i.e. in a little cluster) I don't know hot to pick only one point and leave the others, since i will get the same metrics for all points in the cluster (they all have the same distance to each other)
Upvotes: 0
Views: 1040
Reputation: 50488
The computation is slow because the complexity of your algorithm is quadratic: O(k * n * n)
where n is len(xyTotal)
and k
is the probability of the condition being true. Thus, assuming k=0.1
and n=5000000
, the running time will be huge (likely hours of computation).
Hopefully, you can write a better implementation running in O(n * log(n))
time. However, this is tricky to implement. You need to add your ng
points in a k-d tree and then you can search the nearest neighbor and check the distance with the current point is greater than tol
.
Note that you can find Python modules implementing k-d trees and the SciPy documentation provides an example of implementation written in pure Python (so likely not very efficient).
Upvotes: 1