Imagine that you have a large set of #m objects with properties A and B . What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query? find all objects where A between X and Y, order by B, return first N results; That is, filter by range A and sort by B , but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options: With records (or index) sorted by B : Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m) , which for large data sets of size m is not good enough. With records (or index) sorted by A : Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N . That's O(log m + k + k log k) . If k is small then that's really O(log m) , but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all m objects. Adaptative 2/1 : do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries. Is there an algorithm / data structure which allows answering this query in sublinear time? If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?

Reputation: 1001

Data structure / algorithm for query: filter by A, sort by B, return N results

Imagine that you have a large set of #m objects with properties A and B. What data structure can you use as index(s) (or which algorithm) to improve the performance of the following query?

find all objects where A between X and Y, order by B, return first N results;

That is, filter by range A and sort by B, but only return the first few results (say, 1000 at most). Insertions are very rare, so heavy preprocessing is acceptable. I'm not happy with the following options:

With records (or index) sorted by B: Scan the records/index in B order, return the first N where A matches X-Y. In the worst cases (few objects match the range X-Y, or the matches are at the end of the records/index) this becomes O(m), which for large data sets of size m is not good enough.
With records (or index) sorted by A: Do a binary search until the first object is found which matches the range X-Y. Scan and create an array of references to all k objects which match the range. Sort the array by B, return the first N. That's O(log m + k + k log k). If k is small then that's really O(log m), but if k is large then the cost of the sort becomes even worse than the cost of the linear scan over all mobjects.
Adaptative 2/1: do a binary search for the first match of the range X-Y (using an index over A); do a binary search for the last match of the range. If the range is small continue with algorithm 2; otherwise revert to algorithm 1. The problem here is the case where we revert to algorithm 1. Although we checked that "many" objects pass the filter, which is the good case for algorithm 1, this "many" is at most a constant (asymptotically the O(n) scan will always win over the O(k log k) sort). So we still have an O(n) algorithm for some queries.

Is there an algorithm / data structure which allows answering this query in sublinear time?

If not, what could be good compromises to achieve the necessary performance? For instance, if I don't guarantee returning the objects best ranking for their B property (recall < 1.0) then I can scan only a fraction of the B index. But could I do that while bounding the results' quality somehow?

Upvotes: 7

Answers (6)

dhruvbird

Reputation: 6189

The question you are asking is essentially a more general version of:

Q. You have a sorted list of words with a weight associated with each word, and you want all words which share a prefix with a given query q, and you want this list sorted by the associated weight.

Am I right?

If so, you might want to check this paper which discusses how to do this in O(k log n) time, where k is the number of elements in the output set desired and n is the number of records in the original input set. We assume that k > log n.

http://dhruvbird.com/autocomplete.pdf

(I am the author).

Update: A further refinement I can add is that the question you are asking is related to 2-dimensional range searching where you want everything in a given X-range and the top-K from the previous set, sorted by the Y-range.

2D range search lets you find everything in an X/Y-range (if both your ranges are known). In this case, you only know the X-range, so you would need to run the query repeatedly and binary search on the Y-range till you get K results. Each query can be performed using O(log n) time if you employ fractional cascading, and O(log²n) if employing the naive approach. Either of them are sub-linear, so you should be okay.

Additionally, the time to list all entries would add an additional O(k) factor to your running time.

Upvotes: 3

Fergie

Reputation: 6245

The outcome you describe is what most search engines are built to achieve (sorting, filtering, paging). If you havent done so already, check out a search engine like Norch or Solr.

Upvotes: 1

Erik P.

Reputation: 1627

This is not really a fully fleshed out solution, just an idea. How about building a quadtree on the A and B axes? You would walk down the tree in, say, a breadth-first manner; then:

whenever you find a subtree with A-values all outside the given range [X, Y], you discard that subtree (and don't recurse);
whenever you find a subtree with A-values all inside the given range [X, Y], you add that subtree to a set S that you're building and don't recurse;
whenever you find a subtree with some A-values inside the range [X, Y] and some outside, you recurse into it.

Now you have the set S of all maximal subtrees with A-coordinates between X and Y; there are at most O(sqrt(m)) of these subtrees, which I will show below.

Some of these subtrees will contain O(m) entries (certainly they will contain O(m) entries all added together), so we can't do anything on all entries of all subtrees. We can now make a heap of the subtrees in S, so that the B-minimum of each subtree is less than the B-minimums of its children in the heap. Now extract B-minimal elements from the top node of the heap until you have N of them; whenever you extract an element from a subtree with k elements, you need to decompose that subtree into O(log(k)) subtrees not containing the recently extracted element.

Now let's consider complexity. Finding the O(sqrt(m)) subtrees will take at most O(sqrt(m)) steps (exercise for the reader, using arguments in the proof below). We should probably insert them into the heap as we find them; this will take O(sqrt(m) * log(sqrt(m))) = O(sqrt(m) * log(m)) steps. Extracting a single element from a k-element subtree in the heap takes O(sqrt(k)) time to find the element, then inserting the O(log(sqrt(k))) = O(log(k)) subtrees back into the heap of size O(sqrt(m)) takes O(log(k) * log(sqrt(m))) = O(log(k) * log(m)) steps. We can probably be smarter using potentials, but we can at least bound k by m, so that leaves N*(O(sqrt(k) + log(k)*log(m))) = O(N * (sqrt(m) + log(m)^2) = O(N*sqrt(m)) steps for the extraction, and O(sqrt(m)*(N + log(m))) steps in total... which is sublinear in m.

Here's a proof of the bound of O(sqrt(m)) subtrees. There are several strategies for building a quadtree, but for ease of analysis, let's say that we make a binary tree; in the root node, we split the data set according to A-coordinate around the point with median A-coordinate, then one level down we split the data set according to B-coordinate around the point with median B-coordinate (that is, median for the half of the points contained in that half-tree), and continue alternating the direction per level.

The height of the tree is log(m). Now let's consider for how many subtrees we need to recurse. We only need to recurse if a subtree contains the A-coordinate X, or it contains the A-coordinate Y, or both. At the (2*k)th level down, there are 2^(2*k) subtrees in total. By then, each subtree has its A-range subdivided k times already, and every time we do that, only half the trees contain the A-coordinate X. So at most 2^k subtrees contain the A-coordinate X. Similarly, at most 2^k will contain the A-coordinate Y. This means that in total we will recurse into at most 2*sum(2^k, k = 0 .. log(m)/2) = 2*(2^(log(m)/2 - 1) + 1) = O(sqrt(m)) subtrees.

Since we examine at most 2^k subtrees at the (2*k)'th level down, we can also add at most 2^k subtrees at that level to S. This gives the final result.

Upvotes: 2

ah jeez not again

Reputation: 21

Set up a segment tree on A and, for each segment, precompute the top N in range. To query, break the input range into O(log m) segments and merge the precomputed results. Query time is O(N log log m + log m); space is O(m log N).

Upvotes: 2

Jim Mischel

Reputation: 134125

If the number of items you want to return is small--up to about 1% of the total number of items--then a simple heap selection algorithm works well. See When theory meets practice. But it's not sub-linear.

For expected sub-linear performance, you can sort the items by A. When queried, use binary search to find the first item where A >= X, and then sequentially scan items until A > Y, using the heap selection technique I outlined in that blog post.

This should give you O(log n) for the initial search, and then O(m log k), where m is the number of items where X <= A <= Y, and k is the number of items you want returned. Yes, it will still be O(n log k) for some queries. The deciding factor will be the size of m.

Upvotes: 2

amit

Reputation: 178521

assuming N << k < n, it can be done in O(logn + k + NlogN), similar to what you suggested in option 2, but saves some time, you don't need to sort all the k elements, but only N, which is much smaller!

the data base is sorted by A.

(1) find the first element and last element, and create a list containing these
    elements.
(2) find the N'th biggest element, using selection algorithm (*), and create a new 
    list of size N, with a second iteration: populate the last list with the N highest 
    elements.
(3) sort the last list by B.

Selection algorithm: find the N'th biggest element. it is O(n), or O(k) in here, because the list's size is k.

complexity:
Step one is trivially O(logn + k).
Step 2 is O(k) [selection] and another iteration is also O(k), since this list has only k elements.
Step 3 is O(NlogN), a simple sort, and the last list contains only N elements.

Upvotes: 2

Data structure / algorithm for query: filter by A, sort by B, return N results

Answers (6)

Related Questions