Reputation: 105037
Consider a list of intervals of the form (start, end)
. The intervals are already sorted in the list by their start
component.
My question is if there's a way to calculate for each different interval, how many intervals will overlap with it, in O(n)
time.
I envision several variants working in O(n lg n)
time, but I was curious regarding the O(n)
constraint.
For instance, an O(n lg n)
solution would be:
start
and end
values to an array A and sort the array (so we're now in the O(n lg n)
realm*), eliminating any duplicates in the process.A
create an array R
of regions (N-1 regions with R[i] = (A[i], A[i+1])
).O(n)
.*Well, if we knew the intervals were densely packed in a small region we could use a counting sort and that would bring us down back to O(n)
but this doesn't seem like a good assumption for the general case.
Any way to improve this to O(n)
?
Upvotes: 4
Views: 1051
Reputation: 65447
Not with a comparison-based algorithm. (Most likely this argument can be extended to algebraic decision trees of fixed degree, but that would make it a lot more technical.)
As with sorting, the key is to observe that, with n! possibilities for the output, we cannot run any faster than lg n! = Theta(n log n) comparisons, each of which yields one bit (assuming a non-degenerate input, which, since we're arguing the lower bound here, is under our control, so not an assumption at all). Here are the encoding/decoding algorithms. (The formal proof is left as an exercise.)
Encoding: input is a1, ..., an such that aj in {1, ..., n + 1 - j}
output is overlap counts c1, ..., cn of some instance
For j = 1 to n,
Add an interval (j, j + aj - 1/2)
For j = 1 to n,
Output the overlap count cj of the interval beginning at j
Decoding: input is overlap counts c1, ..., cn.
output is a1, ..., an
For j = 1 to n,
Initialize cj' = cj
For j = 1 to n,
Set aj = cj' + 1
For k = j + 1 to j + cj',
ck' = ck' - 1
With a slight modification, this argument proves the asymptotic optimality of amit's algorithm in the parameter k.
Upvotes: 2
Reputation: 178421
Let each interval i
be denoted as s_i,e_i
(start_i,end_i).
I can show an algorithm which is O(nlogk)
, where k
is the maximal number of intervals that intersects (s_i,s_{i+1})
- for some i
.
While at worst case, k
is in O(n)
, it does improve performance for more sparsed intervals.
We will use a min-heap to store intervals while iterating the list, the min-heap will be sorted according to the end value (e_i
).
The idea is to iterate the list by ascending start, and to count the number of intervals that were seen, but the end value was higher than the interval.
Pseudo code (with explanations):
h = new min heap //sorted by end value
h.push (-infinity,infinity) //add a dummy interval for avoiding dealing with empty heap cases
res = 0
for each interval (s_i,e_i) in ascending order of s_i:
//push out all already "expired" intervals:
while (heap.min() < s_i):
heap.pop()
// at this point, all intervals in the heap:
// 1. started before s_i
// 2. finish after s_i
// thus, each of them is intersecting with current interval.
res = res + heap.size() - 1 //-1 for removing dummy interval (-inf,inf)
heap.push(e_i)
return res
Time complexity:
k
at each step (as defined above).O(nlogk)
.Correctness:
Claim:
Two intervals (s_i,e_i) and (s_j,e_j) intersects, if and only if:
s_i <= s_j <= e_i OR s_j <= s_i <= e_j
Proof is simple by checking all possibilities for the 2 intervals (we have 4!/(2!2!)=6
possibilities since s_i<=e_i, s_j<=e_j
)
(1) s_i <= e_i <= s_j <= e_j - no overlap
(2) s_j <= e_j <= s_i <= e_j - no overlap
(3) s_j <= s_i <= e_i <= e_j - overlap, and condition meets
(4) s_i <= s_j <= e_j <= e_j - overlap, and condition meets
(5) s_j <= s_i <= e_j <= e_i - overlap, and condition meets
(6) s_i <= s_j <= e_i <= e_j - overlap, and condition meets
Back to the proof:
So, we know that if two intervals intersects, when meeting the second one of them (let it be (s_i,e_i)
), the first one (s_j,e_j)
is still in the heap since s_i <= e_j
, and we add the intersection of (s_i,e_i)
with (s_j,e_j)
to the count. We know it is also correct insertion because we have already seen s_j
, so we know s_j <= e_j <= s_i
, and by the above claim - this is indeed an intersecting interval.
In addition, since for each intersecting intervals (s_i,e_i)
and (s_j,e_j)
, we are guaranteed that (s_j,e_j)
is still in the heap when processing (s_i,e_i)
(from above claim, and since we will never remove it because for each k
we already processed: s_k <= s_i <= e_j -> e_j >= s_k
), it is guaranteed that the intersection of (s_j,e_j)
and (s_i,e_i)
will be counted when we add the size of the heap in the second interval traversal.
QED
Small assumption: Not sure this handles duplicates well, should take care of these edge cases by carefully looking at <
and <=
compares.
Python Code:
intervals = [(0,3.5),(1,2),(1.5,2.5),(2.1,3),(4,5)]
#5 overlaps
def findNumOverlapping(intervals):
import heapq
h = []
heapq.heappush(h, 10000) #infinity
res = 0
for (s,e) in intervals:
while (heapq.nsmallest(1, h)[0] < s):
heapq.heappop(h)
res = res + len(h) - 1
heapq.heappush(h,e)
return res
Upvotes: 3