Consider a list of intervals of the form (start, end) . The intervals are already sorted in the list by their start component. My question is if there's a way to calculate for each different interval, how many intervals will overlap with it, in O(n) time. I envision several variants working in O(n lg n) time, but I was curious regarding the O(n) constraint. For instance, an O(n lg n) solution would be: Add all the start and end values to an array A and sort the array (so we're now in the O(n lg n) realm*), eliminating any duplicates in the process. From that array A create an array R of regions (N-1 regions with R[i] = (A[i], A[i+1]) ). Now it's just a matter of iterating over the intervals array and increment the value of all its associated regions. This is done in O(n) . *Well, if we knew the intervals were densely packed in a small region we could use a counting sort and that would bring us down back to O(n) but this doesn't seem like a good assumption for the general case. Any way to improve this to O(n) ?

Reputation: 105037

Calculating number of overlapping intervals in O(n) time?

Consider a list of intervals of the form (start, end). The intervals are already sorted in the list by their start component.

My question is if there's a way to calculate for each different interval, how many intervals will overlap with it, in O(n) time.

I envision several variants working in O(n lg n) time, but I was curious regarding the O(n) constraint.

For instance, an O(n lg n) solution would be:

Add all the start and end values to an array A and sort the array (so we're now in the O(n lg n) realm*), eliminating any duplicates in the process.
From that array A create an array R of regions (N-1 regions with R[i] = (A[i], A[i+1])).
Now it's just a matter of iterating over the intervals array and increment the value of all its associated regions. This is done in O(n).

*Well, if we knew the intervals were densely packed in a small region we could use a counting sort and that would bring us down back to O(n) but this doesn't seem like a good assumption for the general case.

Any way to improve this to O(n)?

Upvotes: 4

Answers (2)

David Eisenstat

Reputation: 65447

Not with a comparison-based algorithm. (Most likely this argument can be extended to algebraic decision trees of fixed degree, but that would make it a lot more technical.)

As with sorting, the key is to observe that, with n! possibilities for the output, we cannot run any faster than lg n! = Theta(n log n) comparisons, each of which yields one bit (assuming a non-degenerate input, which, since we're arguing the lower bound here, is under our control, so not an assumption at all). Here are the encoding/decoding algorithms. (The formal proof is left as an exercise.)

Encoding: input is a1, ..., an such that aj in {1, ..., n + 1 - j}
          output is overlap counts c1, ..., cn of some instance

For j = 1 to n,
    Add an interval (j, j + aj - 1/2)
For j = 1 to n,
    Output the overlap count cj of the interval beginning at j

Decoding: input is overlap counts c1, ..., cn.
          output is a1, ..., an

For j = 1 to n,
    Initialize cj' = cj
For j = 1 to n,
    Set aj = cj' + 1
    For k = j + 1 to j + cj',
        ck' = ck' - 1

With a slight modification, this argument proves the asymptotic optimality of amit's algorithm in the parameter k.

Upvotes: 2

amit

Reputation: 178421

Let each interval i be denoted as s_i,e_i (start_i,end_i).

I can show an algorithm which is O(nlogk), where k is the maximal number of intervals that intersects (s_i,s_{i+1}) - for some i.
While at worst case, k is in O(n), it does improve performance for more sparsed intervals.

We will use a min-heap to store intervals while iterating the list, the min-heap will be sorted according to the end value (e_i).

The idea is to iterate the list by ascending start, and to count the number of intervals that were seen, but the end value was higher than the interval.

Pseudo code (with explanations):

h = new min heap //sorted by end value
h.push (-infinity,infinity) //add a dummy interval for avoiding dealing with empty heap cases
res = 0
for each interval (s_i,e_i) in ascending order of s_i:
        //push out all already "expired" intervals:
        while (heap.min() < s_i):
            heap.pop()
        // at this point, all intervals in the heap:
        //    1. started before s_i
        //    2. finish after s_i
        // thus, each of them is intersecting with current interval.
        res = res + heap.size() - 1 //-1 for removing dummy interval (-inf,inf)
        heap.push(e_i)
return res

Time complexity:

Heap size is at most k at each step (as defined above).
Each interval is pushed once and removed once, each O(logk).
This gives total of O(nlogk).

Correctness:

Claim:

Two intervals (s_i,e_i) and (s_j,e_j) intersects, if and only if:

s_i <= s_j <= e_i   OR s_j <= s_i <= e_j

Proof is simple by checking all possibilities for the 2 intervals (we have 4!/(2!2!)=6 possibilities since s_i<=e_i, s_j<=e_j)

(1) s_i <= e_i <= s_j <= e_j           - no overlap
(2) s_j <= e_j <= s_i <= e_j           - no overlap
(3) s_j <= s_i <= e_i <= e_j           - overlap, and condition meets
(4) s_i <= s_j <= e_j <= e_j           - overlap, and condition meets
(5) s_j <= s_i <= e_j <= e_i           - overlap, and condition meets
(6) s_i <= s_j <= e_i <= e_j           - overlap, and condition meets

Back to the proof:

So, we know that if two intervals intersects, when meeting the second one of them (let it be (s_i,e_i)), the first one (s_j,e_j) is still in the heap since s_i <= e_j, and we add the intersection of (s_i,e_i) with (s_j,e_j) to the count. We know it is also correct insertion because we have already seen s_j, so we know s_j <= e_j <= s_i, and by the above claim - this is indeed an intersecting interval.

In addition, since for each intersecting intervals (s_i,e_i) and (s_j,e_j), we are guaranteed that (s_j,e_j) is still in the heap when processing (s_i,e_i) (from above claim, and since we will never remove it because for each k we already processed: s_k <= s_i <= e_j -> e_j >= s_k), it is guaranteed that the intersection of (s_j,e_j) and (s_i,e_i) will be counted when we add the size of the heap in the second interval traversal.

QED

Small assumption: Not sure this handles duplicates well, should take care of these edge cases by carefully looking at < and <= compares.

Python Code:

intervals = [(0,3.5),(1,2),(1.5,2.5),(2.1,3),(4,5)]
#5 overlaps
def findNumOverlapping(intervals):
    import heapq
    h = []
    heapq.heappush(h, 10000) #infinity
    res = 0
    for (s,e) in intervals:
        while (heapq.nsmallest(1, h)[0] < s):
            heapq.heappop(h)
        res = res + len(h) - 1
        heapq.heappush(h,e)
    return res

Upvotes: 3

Calculating number of overlapping intervals in O(n) time?

Answers (2)

Related Questions