Reputation: 153

How to find all overlaps in two lists of ranges efficiently in Java

I am having a problem finding all overlapping ranges in two lists efficiently.
This problem is similar to This question, but with different inputs.

I have 2 input files, one that contains many lines of range and data pairs, and another that contains a list of ranges to find the intersections to.

I already wrote a file reader class that reads from the data file, returning objects, one at a time, that hold a list of range and data pairs, but am running into trouble when I try to find the overlaps of the two range lists.

Currently what I am doing is brute forcing it, comparing every range in the data list to every other range in the intersection list, but because the data file is very large, it is taking a long time.

Sample Objects:
This is the object in the data list:

public DataModel {
    private int start; {set; get;}
    private int end; {set; get;}
    //Other Data
}

The range Model is just a list of paired integers (start, end).

while (fileParser.hasNext()) {
    dataList = fileParser.next();
    for (DataModel data : dataList)
        for (RangeModel range : rangeList)
            if(overlaps(data, range))
                print(range.getString + " " + data.getString);
}

Edit for clarity:

The DataModel is given in smaller packets of similar ranges of varying length, but they are mostly under 20, so the comparison will be run repeatedly on the same RangeModel and each new DataModel. The total ranges in all the data is around 2 billion, but it doesn't really matter. Thanks for the help.

Upvotes: 1

Answers (4)

nhahtdh

Reputation: 56809

Check whether my understanding is correct:

DataModel and RangeModel represent ranges. (DataModel may contain more data, but it is irrelevant).
There are approx. 2 million DataModels, and a small number of RangeModels. (My solution doesn't take advantage of this asymmetry, though)
It is necessary to retain the ranges in DataModel as different entities, even if they are overlapping. (If you are interested in the intersection only, you can collapse the ranges when they are near to each other as an optimization).

The method I'm going to describe can do range intersection between 2 list of ranges, regardless of how the ranges look like (overlapping, big range, etc.). The limit is on the sum of the sizes of the 2 list of ranges (sorting is bottleneck), and the number of ranges found (iterating through is a bottleneck).

Split the ranges into 2 EndPoints objects, which indicates: value (int), start or end of range (boolean), the starting EndPoint object (null in start of range; points to the EndPoint object that represents the start of range if end of range), tag (int, which mark whether it is a data, or the range to query).

Put all the EndPoints from both list of ranges together, sort them by value, tie break by putting start in front of end endpoint (if you consider touching being intersection). The complexity of sorting step is O((m + n)log(m + n)).

Loop through the sorted EndPoints according to this pseudo-code:

open_data = HashSet()
open_range = HashSet()

for e in endpoints:
  if e is start of range:
    if e is data:
      print e intersect with all in open_range
      open_data.add(e)
    else: // e is range to test
      print e intersect with all in open_data
      open_range.add(e)
  else: // e is end of range
    if e is data:
      open_data.remove(e.startPoint)
    else: // e is range to test
      open_range.remove(e.startPoint)

Adding and removing from the HashSet is amortized O(1). The problem is with printing intersection, which is O(k), where k is the number of intersections, and can be up to O(m * n) in worst case.

Combined, the complexity is O((m + n)log(m + n) + m * n) in worst case. You may be able to do better based on the property of the data. This is a very general solution.

Upvotes: 1

Peter Lawrey

Reputation: 533492

You can sort each list of ranges o(N ln N) and perform a merge sort of those ranges O(N)

This will show up any overlapping ranges with a minimum of CPU time.

Upvotes: 0

Martín Valdés de León

Reputation: 1336

I can think of different optimizations, but they depend on what kind of data you want available after the check.

Sorting both the data and the ranges and processing them in order provides an instant performance improvement, since it makes no sense to test a range starting in 100 against another one ending in 50.

Another improvement would be to 'compress' the ranges. If you have ranges like (1-10), (10-20), (20-30), then you could easily replace them with a single (1-30) range, and reduce the number of tests. You can create an appropiate AggregateRange class that keeps track of the identities of its composing ranges in case you still want to know which original range is causing the overlap.

Yet another improvement would be to smartly use the previous results as you process the data list. For example: Suppose you test data range (1-10) and it happens to not overlap. Were the next test data range be (2-8), you should not need to test it against the ranges, since your previous result guarantees that it will not overlap.

The basic idea behind this improvement would be to advance the start of any untested data ranges up to and including the end of the last non-overlaping data range. If the new start surpasses its own end, then no testing is required as it does not overlap. This means non-overlaping (1-20) should transform an untested (10-100) into an untested (20-100). This may be trickier to implement, so be careful not to overdo it.

Upvotes: 1

JohnnyO

Reputation: 3068

The ideal solution will depend on the specific characteristics of your data, but sorting your two input sets would be a good first step, which will allow you to reduce the amount of comparisons you will need.

One option would be to create an array from min(startTime) to max(endTime), and at each position, store a reference to an input value which covers this range.

Thus, if your input was
A: [1-5] and B:[3-7], you could have an data structure that looks like

1: A
2: A
3: A,B
4: A,B
5: A,B
6: B
7: B

Then to test for an intersection of [2-4] with the data set, you would simply lookup 2,3,4 in your array list and concatenate the results.

Further speed improvements could be made if you only care about IF there is an intersection, as opposed to exactly who the intersection is with. Or if you only care about AN intersection, rather than ALL intersections.

Upvotes: 0

How to find all overlaps in two lists of ranges efficiently in Java

Answers (4)

Related Questions