rimsoft
rimsoft

Reputation: 115

Implement Range Indexing in so that containment set is computed very efficiently in terms of time complexity

Here's the data set I have databaset in this form (56, 102, Data1) (61, 106, Data2) (45, 200, Data3) .... million rows Assume that I have memory to manage this data in JVM heap.

Give a range (60,100) It should find ranges that contain it... like (56,102) and (45,200), What is the best algo to avoid full scan. Is there any Framework out there that could index the data for me naturally to perform the contains search. Of course I can store it in a RDBMS and scoop it with a SQL. Is there pure Java solution for this ?

Upvotes: 2

Views: 58

Answers (1)

mcdowella
mcdowella

Reputation: 19621

You could try building an https://en.wikipedia.org/wiki/Interval_tree

To quote Wikipedia:

The result is a ternary tree with each node storing:

A center point
A pointer to another node containing all intervals completely to the left of the center point
A pointer to another node containing all intervals completely to the right of the center point
All intervals overlapping the center point sorted by their beginning point
All intervals overlapping the center point sorted by their ending point

(end quote)

Given a range, if you recursively search this tree from the top you can discard a lot of nodes. If the center point is outside your query range you only need to investigate the child on the same side as the query range. If the center point is inside the query range, then neither of two children contain any interval which includes the center point, so they cannot contain any interval which contains the range (since it does contain the center point) so you don't need to worry about EITHER child.

For any node that you visit, you need to go through one of the sorted lists of intervals, but you can chose which one. If the center point is smaller than most (or all) of the query range then I would chose the list of intervals sorted by end point and work through it from large to small. At the beginning of this list most or all of the intervals you will find will contain the query range, and as soon as you find that an end point is smaller than the end of the query range you can quit going through the list.

I can't guarantee an improvement, because there might be a lot of useless intervals piled up in the nodes you look at, but depending on your data you might get a speedup in practice.

Upvotes: 2

Related Questions