Reputation: 153
I am having a problem finding all overlapping ranges in two lists efficiently.
This problem is similar to This question, but with different inputs.
I have 2 input files, one that contains many lines of range and data pairs, and another that contains a list of ranges to find the intersections to.
I already wrote a file reader class that reads from the data file, returning objects, one at a time, that hold a list of range and data pairs, but am running into trouble when I try to find the overlaps of the two range lists.
Currently what I am doing is brute forcing it, comparing every range in the data list to every other range in the intersection list, but because the data file is very large, it is taking a long time.
Sample Objects:
This is the object in the data list:
public DataModel {
private int start; {set; get;}
private int end; {set; get;}
//Other Data
}
The range Model is just a list of paired integers (start, end).
while (fileParser.hasNext()) {
dataList = fileParser.next();
for (DataModel data : dataList)
for (RangeModel range : rangeList)
if(overlaps(data, range))
print(range.getString + " " + data.getString);
}
Edit for clarity:
The DataModel is given in smaller packets of similar ranges of varying length, but they are mostly under 20, so the comparison will be run repeatedly on the same RangeModel and each new DataModel. The total ranges in all the data is around 2 billion, but it doesn't really matter. Thanks for the help.
Upvotes: 1
Views: 4315
Reputation: 56809
Check whether my understanding is correct:
DataModel
s, and a small number of RangeModel
s. (My solution doesn't take advantage of this asymmetry, though)The method I'm going to describe can do range intersection between 2 list of ranges, regardless of how the ranges look like (overlapping, big range, etc.). The limit is on the sum of the sizes of the 2 list of ranges (sorting is bottleneck), and the number of ranges found (iterating through is a bottleneck).
Split the ranges into 2 EndPoint
s objects, which indicates: value (int
), start or end of range (boolean
), the starting EndPoint
object (null
in start of range; points to the EndPoint
object that represents the start of range if end of range), tag (int
, which mark whether it is a data, or the range to query).
Put all the EndPoint
s from both list of ranges together, sort them by value, tie break by putting start in front of end endpoint (if you consider touching being intersection). The complexity of sorting step is O((m + n)log(m + n)).
Loop through the sorted EndPoint
s according to this pseudo-code:
open_data = HashSet()
open_range = HashSet()
for e in endpoints:
if e is start of range:
if e is data:
print e intersect with all in open_range
open_data.add(e)
else: // e is range to test
print e intersect with all in open_data
open_range.add(e)
else: // e is end of range
if e is data:
open_data.remove(e.startPoint)
else: // e is range to test
open_range.remove(e.startPoint)
Adding and removing from the HashSet is amortized O(1). The problem is with printing intersection, which is O(k), where k is the number of intersections, and can be up to O(m * n) in worst case.
Combined, the complexity is O((m + n)log(m + n) + m * n) in worst case. You may be able to do better based on the property of the data. This is a very general solution.
Upvotes: 1
Reputation: 533492
You can sort each list of ranges o(N ln N) and perform a merge sort of those ranges O(N)
This will show up any overlapping ranges with a minimum of CPU time.
Upvotes: 0
Reputation: 1336
I can think of different optimizations, but they depend on what kind of data you want available after the check.
Sorting both the data and the ranges and processing them in order provides an instant performance improvement, since it makes no sense to test a range starting in 100 against another one ending in 50.
Another improvement would be to 'compress' the ranges. If you have ranges like (1-10), (10-20), (20-30), then you could easily replace them with a single (1-30) range, and reduce the number of tests. You can create an appropiate AggregateRange class that keeps track of the identities of its composing ranges in case you still want to know which original range is causing the overlap.
Yet another improvement would be to smartly use the previous results as you process the data list. For example: Suppose you test data range (1-10) and it happens to not overlap. Were the next test data range be (2-8), you should not need to test it against the ranges, since your previous result guarantees that it will not overlap.
The basic idea behind this improvement would be to advance the start of any untested data ranges up to and including the end of the last non-overlaping data range. If the new start surpasses its own end, then no testing is required as it does not overlap. This means non-overlaping (1-20) should transform an untested (10-100) into an untested (20-100). This may be trickier to implement, so be careful not to overdo it.
Upvotes: 1
Reputation: 3068
The ideal solution will depend on the specific characteristics of your data, but sorting your two input sets would be a good first step, which will allow you to reduce the amount of comparisons you will need.
One option would be to create an array from min(startTime) to max(endTime), and at each position, store a reference to an input value which covers this range.
Thus, if your input was
A: [1-5] and B:[3-7], you could have an data structure that looks like
1: A
2: A
3: A,B
4: A,B
5: A,B
6: B
7: B
Then to test for an intersection of [2-4] with the data set, you would simply lookup 2,3,4 in your array list and concatenate the results.
Further speed improvements could be made if you only care about IF there is an intersection, as opposed to exactly who the intersection is with. Or if you only care about AN intersection, rather than ALL intersections.
Upvotes: 0