Reputation: 21
I have a large amount of data (over 10^5 points). I am searching a fast algorithm which finds all points in the dataset, which lie in a circle given by its center point and radius.
I thought about using an kd-tree to calculate for example the 10 nearest points to the circle's center, and then check if they are inside the circle. But I am not sure if this is the correct way.
Upvotes: 1
Views: 1545
Reputation: 51034
A space-partitioning data structure like a k-d tree or quadtree could answer your query exactly; no need for a heuristic like "10 nearest points".
You can recursively search the tree:
For step 2, the rectangle is fully contained in the circle if and only if its four corners are. For steps 1 and 3, you can either test whether the rectangle intersects the circle's bounding box, and conservatively recurse because the rectangle might intersect the circle; or you can do an exact (but slightly more complicated) check between the rectangle and the actual circle, saving a few unnecessary recursive calls.
Since this is Python, instead of returning the points in a collection and then concatenating/unioning the results of recursive calls, you could yield the points from a recursive generator function. That simplifies the implementation somewhat.
Upvotes: 2
Reputation: 6482
If you are only interested in the number of points which are in the circle you can try Numba.
import numpy as np
import numba as nb
import numexpr as ne
def method2(X,Y,cx,cy,r):
"""Numexpr method"""
res = ne.evaluate('((X-cx)**2 + (Y-cy)**2) < r**2')
return res
@nb.njit(fastmath=True,parallel=True)
def method3(X,Y,cx,cy,r):
acc=0
for i in nb.prange(X.shape[0]):
if ((X[i]-cx)**2 + (Y[i]-cy)**2) < r**2:
acc+=1
return acc
Timings
# Ensure repeatable, deterministic randomness!
np.random.seed(42)
# Generate test arrays
N = 1000000
X = np.random.rand(N)
Y = np.random.rand(N)
# Define centre and radius
cx = cy = r = 0.5
#@Mark Setchell
%timeit method2(X,Y,cx,cy,r)
#825 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit method3(X,Y,cx,cy,r)
#480 µs ± 94.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 207425
I benchmarked a numexpr
version against a simple Numpy implementation as follows:
#!/usr/bin/env python3
import numpy as np
import numexpr as ne
# Ensure repeatable, deterministic randomness!
np.random.seed(42)
# Generate test arrays
N = 1000000
X = np.random.rand(N)
Y = np.random.rand(N)
# Define centre and radius
cx = cy = r = 0.5
def method1(X,Y,cx,cy,r):
"""Straight Numpy determination of points in circle"""
d = (X-cx)**2 + (Y-cy)**2
res = d < r**2
return res
def method2(X,Y,cx,cy,r):
"""Numexpr determination of points in circle"""
res = ne.evaluate('((X-cx)**2 + (Y-cy)**2)<r**2')
return res
def method3(data,a,b,r):
"""List based determination of points in circle, with pre-filtering using a square"""
in_square_points = [(x,y) for (x,y) in data if a-r < x < a+r and b-r < y < b+r]
in_circle_points = [(x,y) for (x,y) in in_square_points if (x-a)**2 + (y-b)**2 < r**2]
return in_circle_points
# Timing
%timeit method1(X,Y,cx,cy,r)
%timeit method2(X,Y,cx,cy,r)
# Massage input data (before timing) to match agorithm
data=[(x,y) for x,y in zip(X,Y)]
%timeit method3(data,cx,cy,r)
I then timed it in IPython as follows:
%timeit method1(X,Y,cx,cy,r)
6.68 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit method2(X,Y,cx,cy,r)
743 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit method3(data,cx,cy,r)
1.11 s ± 9.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So the numexpr
version came out 9x faster. As the points lie in the range [0..1], the algorithm is effectively calculating pi
and the two methods come out the same:
method1(X,Y,cx,cy,r).sum()
784973
method2(X,Y,cx,cy,r).sum()
784973
len(method3(data,cx,cy,r))
784973
4 * 784973 / N
3.139
Note: I should point out that numexpr
multi-threads your code across multiple CPU cores for you, automatically. If you feel like experimenting, with the number of threads, you can change it dynamically before calling method2()
, or even inside there, with:
# Split calculations across 6 threads
ne.set_num_threads(6)
Anyone else wishing to test the speed of their method is welcome to use my code as a benchmarking framework.
Upvotes: 4
Reputation: 2113
If you want first to filter a large amount of your dataset without huge computations, you can use the Square of size (2r x 2r) with the same center as the circle (where r is the circle's radius).
If you have the center's coordinates (a,b) and r the radius, then the points (x,y) inside the square verify :
in_square_points = [(x,y) for (x,y) in data if a-r < x < a+r and b-r < y < b+r]
And finally after this filter you can apply the circle equation :
in_circle_points = [(x,y) for (x,y) in in_square_points if (x-a)**2 + (y-b)**2 < r**2]
** EDIT **
if your input is structured like this :
data = [
[13, 45],
[-1, 2],
...
[60, -4]
]
Then you can try, if you prefer common for-loops :
in_square_points = []
for i in range(len(data)):
x = data[i][0]
y = data[i][1]
if a-r < x < a+r and b-r < y < b+r:
in_square_points.append([x, y])
print(in_square_points)
Upvotes: 1
Reputation: 864
To check whether a point (a, b)
is within a circle of center (x, y)
and radius r
, then you can simply do a computation:
within_circle = ((x-a)**2 + (y-b)**2) <= r*r)
This equation uses the property of the circle on which can get the absolute distance to a point (which is also used in the distance formula if you noticed).
Upvotes: 1