Reputation: 1105
Suppose I have four intervals:
a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
The most common/frequent overlap is between 1 and 1.1, since that's covered by all a
, b
and c
(three of the four above).
How can I find this given a list of N intervals? Ideally I want a function that can take in a list of the 4 above intervals, and return pd.Interval(1, 1.1)
as the output.
Upvotes: 1
Views: 489
Reputation: 6483
I supposed you're trying to get the interval that is contained in most other intervals, because the interval with most common overlaps is (-1,1.1]
instead of (1,1.1]
, I don't know if maybe
I'am wrong, but using pd.IntervalsIndex.overlaps I got this:
import pandas as pd
from itertools import product
a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(indexinter.overlaps(pd.Interval(1, 1.1)))
print('Interval (1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(1, 1.1))),' of 4')
print('\n')
print(indexinter.overlaps(pd.Interval(-1, 1.1)))
print('Interval (-1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(-1, 1.1))),' of 4')
Output:
[ True True True False]
Interval (1,1.1] overlaps 3 of 4
[ True True True True]
Interval (-1,1.1] overlaps 4 of 4
Nonethless, to get the interval that is contained in most other intervals, you could try this:
import pandas as pd
import portion as P
d = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
a = pd.Interval(-1,-0.5)
def getmaxoverlapwrap(ls,inter):
intervals=[P.openclosed(i.left, i.right) for i in ls]
for i in intervals[1:]:
newinter=inter&i
if newinter.empty==False:
inter=newinter
return inter
def getmaxoverlap(ls):
intervals=[P.openclosed(i.left, i.right) for i in ls]
ls1=[getmaxoverlapwrap(ls,i) for i in intervals]
index = list(map(lambda x: sum([(x in s) for s in ls]), ls1)).index(max(list(map(lambda x: sum([(x in s) for s in ls]), ls1))))
return ls1[index]
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(getmaxoverlap(ls))
You have to install library portion
(formerly distributed as python-intervals), that is a library that provides us data structure and operations for intervals in Python 3.5+, such as: -Support intervals of any (comparable) objects, -Closed or open, finite or (semi-)infinite intervals, -Interval sets (union of atomic intervals) are supported, -Automatic simplification of intervals, etc.
import pandas as pd
from itertools import product
a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
def getmostcontained(ls):
indexinter = pd.IntervalIndex([pd.Interval(inter.left, inter.right, closed='both') for inter in ls])
rights = [i.right for i in ls]
lefts=[i.left for i in ls]
allcomb = [pd.Interval(i[0],i[1]) for i in list(set(list(product(lefts, rights)))) if i[0]<=i[1]]
count=0
maxi=sum(indexinter.contains(allcomb[0].left))+sum(indexinter.contains(allcomb[0].right))
for i in allcomb:
newmaxi=sum(indexinter.contains(i.left))+sum(indexinter.contains(i.right))
if newmaxi>maxi:
count=allcomb.index(i)
return allcomb[count]
ls=[a,b,c,d]
print(getmostcontained(ls))
import pandas as pd
a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
def getmaxoverlap(ls):
indexinter = pd.IntervalIndex(ls)
start=ls[0].left
end=ls[0].right
checkstart=sum(indexinter.contains(start+0.00000000099))
checkend=sum(indexinter.contains(end))
for i in ls:
if sum(indexinter.contains(i.left+0.000000000099))>checkstart:
checkstart=sum(indexinter.contains(i.left+0.000000000099))
start=i.left
if sum(indexinter.contains(i.right))>checkend:
checkend=sum(indexinter.contains(i.right))
end=i.right
return pd.Interval(start, end)
ls=[a,b,c,d]
print(getmostcontained(ls))
Output of all options:
(1,1.1]
The first option took 0.025967300000047544 seconds
to run 100 times the script, and took 27.19270740000138 seconds
to run 100000 times the script, he second option took 0.15769210000144085 seconds
to run 100 times the script, and took 115.13360900000043 seconds
to run 100000 times the script, and third option took 0.08970480000061798 seconds
to run 100 times the script, and took 107.52801619999991 seconds
to run 100000 times the script. The difference is almost not noticeable maybe, at least for the second and third options, but here are the time stamps I got just so you know and choose one.
Upvotes: 2