Reputation: 189
I am using a kd-tree algorithm (found here) to find the nearest neighbour of a point in matrix 1 with respect to all points in matrix 2. The algorithm linked above is pretty quick, being able to find 3E6 nearest neighbours in about 20 seconds using
nn2(SetA,SetB,k=1)
Now, I only want to include nearest neighbours within a certain radius of each other so I tried
nn2(SetA,SetB,k=1,searchtype='radius',radius=1000)
Which works well, but slows down the computation incredibly, really by orders of magnitude (factor 1000 or more). I don't understand why this would happen, because the way I see it the maximum radius should in fact reduce the computation time because not the entire space has to be scanned.
Can someone explain either what is going wrong? Or why this is expected behaviour?
Example code that reproduces the behavior
library(data.table)
library(RANN)
N=50000
DT1=data.table(x=sample(0:300,N,replace=T),y=sample(301:600,N,replace=T))
DT2=data.table(x=sample(0:300,N,replace=T),y=sample(301:600,N,replace=T))
ptm=proc.time()
nnlistV1=nn2(DT1,DT2,k=1)
proc.time()-ptm
ptm=proc.time()
nnlistV2=nn2(DT1,DT2,k=1,searchtype="radius",radius=20)
proc.time()-ptm
Upvotes: 3
Views: 302
Reputation: 2907
From the ANNmanual,
Because it visits all the points in the search radius one by one, the search procedure is rather inefficient if the number of points in the radius bound is large.
I didn't look at all the code, but the annkSearch is the standard code that your first nn2 eventually calls, and the annkFRSearch (FR = Fixed Radius) is the slow code referred to above. The radius version isn't a constraint on the standard, it's completely different.
Running this shows that "the number of points in the radius bound is large" when trying to find k=1.
nrow(DT1[DT1$x >= 130 & DT1$x <= 170, ])
[1] 6711
Running a problem more suited to the fixed radius search with a more dispersed range to sample from and a k that is bigger than 1 but slightly smaller than the points likely found in the range:
N=50000
DT1=data.table(x=sample(0:30000,N,replace=T),y=sample(301:60000,N,replace=T))
DT2=data.table(x=sample(0:30000,N,replace=T),y=sample(301:60000,N,replace=T))
nrow(DT1[DT1$x >= 130 & DT1$x <= 170, ])
# I got 70 on my random sample
ptm=proc.time()
nnlistV1=nn2(DT1,DT2,k=50)
proc.time()-ptm
ptm=proc.time()
nnlistV2=nn2(DT1,DT2,k=50,searchtype="radius",radius=400)
proc.time()-ptm
Your code gave me times of .11 and 1.81. With the changes above, I get times of .69 and .28 for standard and radius respectively.
Upvotes: 1