Jase
Jase

Reputation: 1105

How to find interval with most overlap in pandas

Suppose I have four intervals:

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

The most common/frequent overlap is between 1 and 1.1, since that's covered by all a, b and c (three of the four above).

How can I find this given a list of N intervals? Ideally I want a function that can take in a list of the 4 above intervals, and return pd.Interval(1, 1.1) as the output.

Upvotes: 1

Views: 489

Answers (1)

MrNobody33
MrNobody33

Reputation: 6483

I supposed you're trying to get the interval that is contained in most other intervals, because the interval with most common overlaps is (-1,1.1] instead of (1,1.1], I don't know if maybe I'am wrong, but using pd.IntervalsIndex.overlaps I got this:

import pandas as pd
from itertools import product

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(indexinter.overlaps(pd.Interval(1, 1.1)))
print('Interval (1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(1, 1.1))),' of 4')
print('\n')
print(indexinter.overlaps(pd.Interval(-1, 1.1)))
print('Interval (-1,1.1] overlaps ',sum(indexinter.overlaps(pd.Interval(-1, 1.1))),' of 4')

Output:

[ True  True  True False]
Interval (1,1.1] overlaps  3  of 4


[ True  True  True  True]
Interval (-1,1.1] overlaps  4  of 4

Nonethless, to get the interval that is contained in most other intervals, you could try this:

First option

import pandas as pd
import portion as P

d = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
a = pd.Interval(-1,-0.5)

def getmaxoverlapwrap(ls,inter):
    intervals=[P.openclosed(i.left, i.right) for i in ls]
    for i in intervals[1:]:
        newinter=inter&i
        if newinter.empty==False:
            inter=newinter     
    return inter
def getmaxoverlap(ls):
    intervals=[P.openclosed(i.left, i.right) for i in ls]
    ls1=[getmaxoverlapwrap(ls,i) for i in intervals]
    index = list(map(lambda x: sum([(x in s) for s in ls]), ls1)).index(max(list(map(lambda x: sum([(x in s) for s in ls]), ls1))))
    return ls1[index]
ls=[a,b,c,d]
indexinter = pd.IntervalIndex(ls)
print(getmaxoverlap(ls))

You have to install library portion (formerly distributed as python-intervals), that is a library that provides us data structure and operations for intervals in Python 3.5+, such as: -Support intervals of any (comparable) objects, -Closed or open, finite or (semi-)infinite intervals, -Interval sets (union of atomic intervals) are supported, -Automatic simplification of intervals, etc.

Second option

import pandas as pd
from itertools import product

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

def getmostcontained(ls):
    indexinter = pd.IntervalIndex([pd.Interval(inter.left, inter.right, closed='both') for inter in ls])
    rights = [i.right for i in ls]
    lefts=[i.left for i in ls]
    allcomb = [pd.Interval(i[0],i[1]) for i in list(set(list(product(lefts, rights)))) if i[0]<=i[1]]
    count=0
    maxi=sum(indexinter.contains(allcomb[0].left))+sum(indexinter.contains(allcomb[0].right))
    for i in allcomb:
        newmaxi=sum(indexinter.contains(i.left))+sum(indexinter.contains(i.right))

        if newmaxi>maxi:
            count=allcomb.index(i)

    return allcomb[count]

ls=[a,b,c,d]
print(getmostcontained(ls))

Third option

import pandas as pd

a = pd.Interval(0, 2)
b = pd.Interval(1, 2)
c = pd.Interval(0, 1.1)
d = pd.Interval(-1,-0.5)

def getmaxoverlap(ls):
    indexinter = pd.IntervalIndex(ls)
    start=ls[0].left
    end=ls[0].right
    checkstart=sum(indexinter.contains(start+0.00000000099))
    checkend=sum(indexinter.contains(end))


    for i in ls:
        if sum(indexinter.contains(i.left+0.000000000099))>checkstart:
            checkstart=sum(indexinter.contains(i.left+0.000000000099))
            start=i.left
        if sum(indexinter.contains(i.right))>checkend:
            checkend=sum(indexinter.contains(i.right))
            end=i.right

    return pd.Interval(start, end)

ls=[a,b,c,d]
print(getmostcontained(ls))            

Output of all options:

(1,1.1]

The first option took 0.025967300000047544 seconds to run 100 times the script, and took 27.19270740000138 seconds to run 100000 times the script, he second option took 0.15769210000144085 seconds to run 100 times the script, and took 115.13360900000043 seconds to run 100000 times the script, and third option took 0.08970480000061798 seconds to run 100 times the script, and took 107.52801619999991 seconds to run 100000 times the script. The difference is almost not noticeable maybe, at least for the second and third options, but here are the time stamps I got just so you know and choose one.

Upvotes: 2

Related Questions