J S
J S

Reputation: 3496

Data Structure for Quickly Checking Numbers against Overlapping Ranges

Let's say I have the following number ranges:

0-500 0-100 75-127 125-157 130-198 198-200

Now, let's say I need to be able to check any given number and see which ranges it is in. What sort of data structure would be used most efficiently to tell that, say, the number 100 belongs to the ranges 0-500, 0-100, and 75-127? Do I just want a binary tree holding the starting values? In that case, would it be feasible for each node in the tree to hold several objects containing each range at that starting point?

Note that I only need to retrieve for this particular application, I don't really see myself needing to modify it mid-process, so retrieval speed is by far my priority.

Thanks!

Upvotes: 9

Views: 2930

Answers (4)

CMPS
CMPS

Reputation: 7769

Edited:

  • Much faster
  • Much more efficient in memory consumption
  • For 1,000 input, it's guarantee it works in less than 100 ms in worst cases
  • For 10 000 input ranges, 10x more than the required input, it still works in less than 100 ms for 50 duplicates of from ex: {0 - #}x50 , {1 - #}x50 etc... for all from ranges.

Goal

Given 10,000 input range intervals (From to - To make the scenario worst, every 'from' is duplicated 50 times). The program takes one target value and displays all the ranges where the target belongs.

Example on how it works for 4 ranges:

Given the following ranges:

0 - 100
0 - 500
50 - 500
20 - 300

Target : 40

Output:

20 - 300
0 - 500
0 - 100

Explanation of the Algorithm (using the previous example):

Every from value is mapped to an index++ starting from index = 0. So, in our example:

From => index
0 => 0
50 => 1
20 => 2

Now having an array of Tree Set called pointers, every index i of this array refers to the key of value i in the Tree Set. So pointers[0] refers to 'from': 0, pointers[1] refers to 'from': 50, pointers[2] refers to 'from': 20.

Now we add the to values to each from value respectively by looking at the tree map 'key' => 'value' where value is the index of key in the pointers array.

Moreover, we want the to values added in the pointers array at every index to be sorted in a descending order (Later I explain why).

So now the pointer becomes like this:

index => TreeSet[values..]
0 => 500 | 100
1 => 500
2 => 300

Now we are ready to get the ranges where the target belongs to.

For target = 40

1 - Search for the closest floor key of 40 in the tree map. The program finds that 20 is the closest one.

2 - It goes to the index corresponding to 20 in the pointers array. To get the index of 20: look at the tree map at key 20. Index of 20 is 2.

3 - So go now to pointers[2], it finds that there is the number 300.

4 - Now the application checks if 300 is less than target, the reason it does this, is because I mentioned before that every Tree Set created is sorted in a descending order. So if 300 is smaller than the target, therefore no need to continue checking for the next values in the pointers[2] because it's guarantee that they are smaller.

5 - In this case, 300 is greater than target, then print the key as from and pointer[2] {current element} as to.

6 - Since only one element found in pointers[2], the for loop exits, and the key is removed from the tree map, so next time when the program wants to find the next closest floor key, it will find the next one, if it exists.

7 - (Next iteration of while loop) The next key found, after removing key 20, is 0, with index 0 according to the tree map.

8 - Go to pointers[0]. It finds that the number of elements in Tree Set at pointers[0] is 2.

9 - Start with the first element pointers[0] {first element}. Is 500 less than 40 ? No, print "range". Next element, is 100 less than 40 ? No, print "range". No more elements exit for loop.

10 - Remove key 0 from tree map.

11 - Now the While loop condition checks if there exist a closest floor key for the target. No, because 0 and 20 were removed. So condition not satisfied, exit while loop.

12 - Print time consumed, and end program :)

Hope you find the explanation useful.

Code

import java.util.Collections;
import java.util.TreeMap;
import java.util.TreeSet;

public class Interval {
    public static void main(String[] args) {
        int MAX_SIZE = 10000;
        int[] from = new int[MAX_SIZE];
        int[] to = new int[MAX_SIZE];

        //Generate 10,000 (from - to), with 50 redundant (from range) for every value of from. (to make the application heavy)
        int c = 0;
        for(int i=0; i<MAX_SIZE;i++){
            from[i] = 0+c;
            to[i] = from[i] + (int)(Math.random()*100);
            if(i%50 == 0) 
                c++;
        }

        //Start time counting
        long time = System.currentTimeMillis();

        int target = 97;
        TreeMap<Integer, Integer> treePointer = new TreeMap<Integer, Integer>(); // will sotre <From, index++>

        int index = 0;
        int size = from.length;
        TreeSet<Integer>[] pointers = new TreeSet[size]; //Array of tree set to store values of every "from" range

        //insert into tree
        for(int i=0; i<from.length;i++){
            if(!treePointer.containsKey(from[i])){ //if the "from" does not exist in the tree yet, insert it
                treePointer.put(from[i], index);
                pointers[index++] = new TreeSet<Integer>(Collections.reverseOrder()); //sort descending order
            }

            //index of "from" in the pointers array
            int virtualIndex = treePointer.get(from[i]);
            //add the 'to' element to the corresponding index of "from" in Tree Set at pointers[index of "from"]
            pointers[virtualIndex].add(to[i]);
        }

        // Display part of the pointers array to understand how elements are stored
//      for(int i=0; i<10; i++){
//          for(int current : pointers[i]){
//              System.out.print(current + " ");
//          }
//          System.out.println();
//      }

        //Start checking for the ranges
        Integer currentKey = -1; //dummy number at first
        while((currentKey = treePointer.floorKey(target)) != null){ // while there is a closest floor key
          //get index assigned to the closest floor key
            int virtualIndex = treePointer.get(currentKey);
            //loop on the elements found at pointers[index of closest floor number]
            for(int toValue : pointers[virtualIndex]){
                if(toValue < target) //remember the values are sorted in a descending order, so whenever the value becomes smaller than the target don't continue the for loop
                    break;
                System.out.println(currentKey + " - " + toValue); // else target less or equal to toValue, so print range
            }
            treePointer.remove(currentKey); //remove key from tree to fetch the next floor key in the next while iteration
        }
        //Display time consumed in ms
        System.out.println("\nTotal time: " + (System.currentTimeMillis() - time) + " ms");
    }
}

Upvotes: 1

Paddy3118
Paddy3118

Reputation: 4772

Assuming you have enough memory, I would use an array of lists, where the list is of all ranges containing the index.

Here is a Python solution showing the algorithm in more detail:

# (Inclusive) ranges
ranges = [(0,500), (0,100), (75,127), (125,157), (130,198), (198,200)]
smallest = min(r[0] for r in ranges)
largest  = max(r[1] for r in ranges)

# Ceate table
table = [[] for i in range(smallest, largest+1)] # List of lists
for r in ranges: # pre-compute results
    mn, mx = r
    for index in range(mn, mx+1):
        table[index - smallest].append(r)

def check(n):
    'Return list of ranges containing n'
    if smallest <= n <= largest:
        return table[n - smallest]
    else:
        return []   # Out of range

for n in [-10, 10, 75, 127, 129, 130, 158, 197, 198, 199, 500, 501]:
    print('%3i is in groups: %r' % (n, check(n)))

The output is:

-10 is in groups: []
 10 is in groups: [(0, 500), (0, 100)]
 75 is in groups: [(0, 500), (0, 100), (75, 127)]
127 is in groups: [(0, 500), (75, 127), (125, 157)]
129 is in groups: [(0, 500), (125, 157)]
130 is in groups: [(0, 500), (125, 157), (130, 198)]
158 is in groups: [(0, 500), (130, 198)]
197 is in groups: [(0, 500), (130, 198)]
198 is in groups: [(0, 500), (130, 198), (198, 200)]
199 is in groups: [(0, 500), (198, 200)]
500 is in groups: [(0, 500)]
501 is in groups: []

Further optimisations such as using a bit-set to store what ranges are at each index instead of a list are possible.

From the comments below, I decided to modify the above to choose between the table lookup method above and a slower but vastly more memory efficient solution based on direct comparison. (Ideally there would be an interval tree based solution included too):

# (Inclusive) ranges
ranges = [(0,500), (0,100), (75,127), (125,157), (130,198), (198,200)]
limit = 1000000 # Or whatever

smallest = min(r[0] for r in ranges)
largest  = max(r[1] for r in ranges)

if (largest - smallest) * len(ranges) < limit:
    # Ceate table
    table = [[] for i in range(smallest, largest+1)] # List of lists
    for r in ranges:
        mn, mx = r
        for index in range(mn, mx+1):
            table[index - smallest].append(r)

    def check(n):
        'Return list of ranges containing n'
        if smallest <= n <= largest:
            return table[n - smallest]
        else:
            return []   # Out of range
else:
    # mpre emory efficient method, for example
    def check(n):
        return [(mn, mx) for mn, mx in ranges if mn <= n <= mx]

for n in [-10, 10, 75, 127, 129, 130, 158, 197, 198, 199, 500, 501]:
    print('%3i is in groups: %r' % (n, check(n)))

Upvotes: 1

Bartosz Marcinkowski
Bartosz Marcinkowski

Reputation: 6881

What you need is interval tree. Interval trees are quite general concept and keep a slightly different data in nodes for every problem. In your case each node would keep a list of input intervals that cover the interval node is representing.

Upvotes: 6

arunmoezhi
arunmoezhi

Reputation: 3200

Let R denote the number of ranges possible. (For your example R=6)

Create R hashtables such that each hash table can hold only one range. For your example you need to create 6 hashtables. The first hash table R1will contain values from 0-500 only.

Populate hashtable.

Each number will go into appropriate hashtables. For your example number 100 will go into R1, R2, R3. If R is large, then you would need to create lot of hashtables. But then the total space is bounded by the actual data stored in all the hashtables put together.

Retrival:

For any given number check if it is present in each of the R hashtables. You can further optimize by choosing which hashtables to look. For example for 100 you only need to look into 3 out of 6 hashtables.

Time complexity:

Searching for a single value in a hash table takes constant time (on average). So Amortized O(1) to see if a number is present in a hashtable.

Amortized O(R) to produce the output as we need to look at all the hashtables to produce the output

Upvotes: 2

Related Questions