Reputation: 517
I have a set of 2D points, and I want to be able to make the following query with arguments x_min
and n
: what are the n
points with largest y
which have x > x_min
?
To rephrase in Ruby:
class PointsThing
def initialize(points)
@points = points
end
def query(x_min, n)
@points.select { |point| point.x > x_min }.sort_by { |point| point.y }.take(n)
end
end
Ideally, my class would also support an insert and delete operation.
I can't think of a data structure for this which would enable the query to run in less than O(|@points|) time. Does anyone know one?
Upvotes: 5
Views: 862
Reputation: 65506
Sort the points by x descending. For each point in order, insert it into a purely functional red-black tree ordered by y descending. Keep all of the intermediate trees in an array.
To look up a particular x_min, use binary search to find the intermediate tree where exactly the points with x > x_min have been inserted. Traverse this tree to find the first n points.
The preprocessing cost is O(p log p) in time and space, where p is the number of points. The query time is O(log p + n), where n is the number of points to be returned in the query.
Upvotes: 2
Reputation: 17258
Notation
Let P
be the set of points.
Let top_y ( n, x_min)
describe the query to collect the n
points from P
with the largest y-coordinates among those with x-coordinate greater than or equal to `x_min' .
Let x_0
be the minimum of x coordinates in your point set. Partition the x axis to the right of x_0
into a set of left-hand closed, right-hand open intervals I_i
by the set of x coordinates of the point set P
such that min(I_i)
is the i
-th but smallest x coordinate from P
. Define the coordinate rank r(x)
of x
as the index of the interval x
is an element of or 0 if x < x_0
.
Note that r(x)
can be computed in O(log #({I_i}))
using a binary search tree.
Simple Solution
Sort your point set by decreasing y-coordinates and save this array A in time O(#P log #P)
and space O(#P)
.
Process each query top_y ( n, x_min )
by traversing this array in order, skipping over items A_i: A_i.x < x_0
, counting all other entries until the counter reaches n
or you are at the end of A
. This processing takes O(n)
time and O(1)
space.
Note that this may already be sufficient: Queries top_y ( n_0, a_0 ); a_0 < min { p.x | p \in P }, n_0 = c * #P, c = const
require step 1 anyway and for n << #P
and 'infrequent' queries any further optimizations weren't worth the effort.
Observation
Consider the sequences s_i,
s_(i+1)of points with x-coordinates greater than or equal to
min(I_i), min(I_(i+1)), ordered by decreasing y-coordinate.
s_(i+1)is a strict subsequence of
s_i`.
If p_1 \in s_(i+1)
and p_2.x >= p_1.x
then p_2 \in s_(i+1)
.
Refined Solution
A refined data structure allows for O(n) + O(log #P)
query processing time.
Annotate the array A
from the simple solution with a 'successor dispatch' for precisely those elements A_i
with A_(i+1).x < A_i.x
; This dispatch data would consist of an array disp:[r(A_(i+1).x) + 1 .. r(A_i.x)]
of A
-indexes of the next element in A
whose x-coordinate ranks at least as high as the index into disp
. The given dispatch indices suffice for processing the query, since ...
disp[j] = disp[r(A_(i+1).x) + 1]
for each j <= r(A_(i+1).x)
.x_min
with r(x_min) > r(A_i.x)
, the algorithm wouldn't be hereThe proper index to access disp
is r(x_min)
which remains constant throughout a query and thus takes O(log #P)
to compute once per query while the index selection itself is O(1)
at each A
element.
disp
can be precomputed. No two disp
entries across all disp
arrays are identical (Proof skipped, but it's easy [;-)] to see given the construction). Therefore the construction of disp
arrays can be performed stack-based in a single sweep through the point set sorted in A
. As there are #P
entries, the disp
structure takes O(#P)
space and O(#P)
time to construct, being dominated by space and time requirements for y-sorting. So in a certain sense, this structure comes for free.
Time requirements for query top_y(n,x_min)
r(x_min)
: O(log #P)
;A
: O(n)
;Upvotes: 1
Reputation: 8292
Some precomputation would help in this case.
First partition the set of points taking x_min as pivot element.
Then for set of points lying on right side of x_min build a max_heap based on y co-ordinates.
Now run your query as: Perform n extract_max operations on the built max_heap.
The running time of your query would be log X + log (X-1) + ..... log (X-(n-1))
log X: For the first extract max operation.
log X-1: For the second extract max operation and so on.
X: Size of original Max heap.
Even in the worst case when your n << X , The time taken would be O(n log X).
Upvotes: 1
Reputation: 1987
If your data are not sorted, then you have no choice but to check every point since you cannot know if there exists another point for which y is greater than that of all other points and for which x > x_min
. In short: you can't know if another point should be included if you don't check them all.
In that case, I would assume that it would be impossible to check in sublinear time as you ask for, since you have to check them all. Best case for searching all would be linear.
If your data are sorted, then your best case will be constant time (all n points are those with the greatest y), and worst case would be linear (all n points are those with least y). Average case would be closer to constant I think if your x and x_min are both roughly random within a specific range.
If you want this to scale (that is, you could have large values of n), you will want to keep your resultant set sorted as well since you will need to check new potential points against it and to drop the lowest value when you insert (if size > n). Using a tree, this can be log time.
So, to do the entire thing, worst case is for unsorted points, in which case you're looking at nlog(n) time. Sorted points is better, in which case you're looking at average case of log(n) time (again, assuming roughly randomly distributed values for x and x_min), which yes is sub-linear.
In case it isn't at first obvious why sorted points will have have constant time to search through, I will go over that here quickly.
If the n points with the greatest y values all had x > x_min
(the best case) then you are just grabbing what you need off the top, so that case is obvious.
For the average case, assuming roughly randomly distributed x and x_min, the odds that x > x_min
are basically half. For any two random numbers a and b, a > b
is just as likely to be true as b > a
. This is the same thing with x and x_min; x > x_min
is equally as likely to be true as x_min > x
, meaning 0.5 probability. This means that, for your points, on average every second point checked will meet your x > x_min
requirement, so on average you will check 2n points to find the n highest points that meet your criteria. So the best case was c time, average is 2c which is still constant.
Note, however, that for values of n approaching the size of the set this hides the fact that you are going through the entire set, essentially bringing it right back up to linear time. So my assertion that it is constant time does not hold true if you assume random values of n within the range of the size of your set.
If this is not a purely academic question and is prompted by some actual need, then it depends on the situation.
(edit) I just realized that my constant-time assertions were assuming a data structure where you have direct access to the highest value and can go sequentially to lower values. If the data structure that those are provided to you in does not fit that description, then obviously that will not be the case.
Upvotes: 1