horcle_buzz
horcle_buzz

Reputation: 2161

How to efficiently find overlapping intervals?

I have the following toy example dataframe, df:

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977
   0.485997  0.491911
   0.503259  0.508679
   0.504687  0.510075
   0.504687  0.670075
   0.666093  0.670438
   0.765602  0.770028
   0.766884  0.771307
   0.775986  0.780398
   0.794590  0.798965

to find overlapping subsets of this, I am using the following code:

df = df.sort_values('f_low')
for row in df.itertuples():
    iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
    span_range = pd.Interval(row.f_low, row.f_high)
    fx = df[(iix.overlaps(span_range))].copy()

I would LIKE to get overlapping dataframes like this:

   # iteration 1: over row.f_low=0.476201  row.f_high=0.481915 

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977

   # iteration 2: over row.f_low=0.503259  row.f_high=0.508679 
      f_low    f_high
   0.503259  0.508679 
   0.504687  0.510075
   0.504687 0.670075

   # iteration 3: over row.f_low=0.504687  row.f_high=0.670075 
      f_low    f_high
   0.666093  0.670438

etc.

This works great, but since the dataframe is quite large and there are a lot of overlaps, this takes a long time to process. Also, the interval I am testing for overlaps does not grab itself when using the Interval and overlaps methods for pandas.

What this is meant to represent is a series of overlapping confidence intervals with each row that gets iterated over.

Is there a way to more efficiently extract overlapping intervals against a given interval besides iterating through all the tuples?

Here is a chunk of the actual dataframe unsorted:

f_low   f_high
0.504687  0.670075
0.476201  0.481915
0.765602  0.770028
0.479161  0.484977
0.766884  0.771307
0.485997  0.491911
0.666093  0.670438
0.503259  0.508679
0.775986  0.780398
0.504687  0.510075
0.794590  0.798965

Upvotes: 3

Views: 3669

Answers (5)

piRSquared
piRSquared

Reputation: 294218

Continuous Overlap

Treat "f_low" values as entry points and assign a value of 1. Treat "f_high" values as exit points and assign a value of -1. If we proceed through all values in increasing order and accumulate assigned values then we will have an overlapping interval when the accumulated value is greater than zero. We know we've exited any overlapping intervals if the accumulated value reaches zero.

NOTE:

This groups all intervals that continuously overlap. If an interval doesn't overlap with the first BUT does overlap with the last one in the chain, then it counts as overlapping.

I'll provide an analogous solution for the other option below this solution.


Attempted Example

#  1     3                     (Interval from 1 to 3)
#     2        5               (Interval from 2 to 5)
#                    7     9   (Interval from 7 to 9)

#  1  1 -1    -1     1    -1   (Entry/Exit values)
#  1  2  1     0     1     0   (Accumulated values)
#              ⇑           ⇑
# zero indicates leaving all overlaps

This indicates that we start once we enter into the interval from 1 to 3, we don't leave all overlapping intervals until we get to 5 the right side of the interval from 2 to 5 as indicate by the accumulated value reaching zero.


I'll use a generator to return lists of the indices of the original dataframe that have overlapping intervals.

When all is said and done, this should be N * Log(N) for the sorting involved.

def gen_overlaps(df):
    df = df.sort_values('f_low')
    
    # get sorter lows and highs
    a = df.to_numpy().ravel().argsort()
    
    # get free un-sorter
    b = np.empty_like(a)
    b[a] = np.arange(len(a))
    
    # get ones and negative ones
    # to indicate entering into
    # and exiting an interval
    c = np.ones(df.shape, int) * [1, -1]
    
    # if we sort by all values and
    # accumulate when we enter and exit
    # the accumulated value should be 
    # zero when there are no overlaps
    d = c.ravel()[a].cumsum()[b].reshape(df.shape)
    #             ⇑           ⇑
    # sort by value order     unsort to get back to original order
    
    indices = []
    for i, indicator in zip(df.index, d[:, 1] == 0):
        indices.append(i)
        if indicator:
            yield indices
            indices = []
    if indices:
        yield indices
    

Then I'll use pd.concat to organize them to show what I mean. k is the kth group. Some groups only have one interval.

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

         f_low    f_high
0 0   0.476201  0.481915
  1   0.479161  0.484977
1 2   0.485997  0.491911
2 3   0.503259  0.508679
  4   0.504687  0.510075
  5   0.504687  0.670075
  6   0.666093  0.670438
3 7   0.765602  0.770028
  8   0.766884  0.771307
4 9   0.775986  0.780398
5 10  0.794590  0.798965

If we only wanted the ones that overlapped...

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
    if len(i) > 1
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
2 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
  6  0.666093  0.670438
3 7  0.765602  0.770028
  8  0.766884  0.771307

Only Overlap Next Interval in Queue

This is a simpler solution and matches OPs desired output.

def gen_overlaps(df):
    df = df.sort_values('f_low')
        
    indices = []
    cursor = None
    for i, low, high in df.itertuples():
        if not indices:
            cursor = high
        if low <= cursor:
            indices.append(i)
        else:
            yield indices
            indices = []
            cursor = high
    if len(indices) > 1:
        yield indices
    

pd.concat({
    k: df.loc[i] for k, i in
    enumerate(gen_overlaps(df))
})

        f_low    f_high
0 0  0.476201  0.481915
  1  0.479161  0.484977
1 3  0.503259  0.508679
  4  0.504687  0.510075
  5  0.504687  0.670075
2 7  0.765602  0.770028
  8  0.766884  0.771307

Upvotes: 4

Alonso Ogueda Oliva
Alonso Ogueda Oliva

Reputation: 191

I'm not sure what kind of overlapping do you need, but I think this approach can work for it:

  • To make sure your masking is adequate.
  • Create a dictionary which keys are f_low and f_high from every iteration.
  • Filter the original dataframe
  • As you said, the real use case should be with a large datasets, so query must be better than .loc
import pandas as pd
df = pd.DataFrame(
    [
        [0.504687, 0.670075],
        [0.476201, 0.481915],
        [0.765602, 0.770028],
        [0.479161, 0.484977],
        [0.766884, 0.771307],
        [0.485997, 0.491911],
        [0.666093, 0.670438],
        [0.503259, 0.508679],
        [0.775986, 0.780398],
        [0.504687, 0.510075],
        [0.794590, 0.798965]
    ],
    columns=["f_low", "f_high"]
)
overlap = {
    (row.f_low, row.f_high): df.query("(@row.f_low <= f_low <= @row.f_high) or (@row.f_low <= f_high <= @row.f_high)")
    for row in df.itertuples()
}

Upvotes: 0

Code Different
Code Different

Reputation: 93151

Using numpy's array broadcasting:

l1 = df['f_low'].to_numpy()
h1 = df['f_high'].to_numpy()

l2 = l1[:, None]
h2 = h1[:, None]

# Check for overlap
# mask is an n * n matrix indicating if interval i overlaps with interval j
mask = (l1 < h2) & (h1 > l2)

# If interval i overlaps intervla j then j also overlaps i. We only want to get
# one of the two pairs. Hence the `triu` (triangle, upper)
# Every interval also overlaps itself and we don't want that either. Hence the k=1
overlaps = np.triu(mask, k=1).nonzero()

The result in overlaps requires some interpretations:

(array([0, 3, 3, 4, 5, 7]),
 array([1, 4, 5, 5, 6, 8]))

# Row 0 overlaps with row 1
# Row 3 overlaps with row 4
# Row 3 overlaps with row 5
# ....

Upvotes: 1

anon01
anon01

Reputation: 11161

If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:

df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
    low = df.f_low[idx]
    high = df.f_high[idx]
    sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
    dfs.append(sub_df)
    idx = sub_df.index.max() + 1
    if idx > df.index.max():
        break

output:

[      f_low    f_high
 0  0.476201  0.481915
 1  0.479161  0.484977,
       f_low    f_high
 2  0.485997  0.491911,
       f_low    f_high
 3  0.503259  0.508679
 4  0.504687  0.510075
 5  0.504687  0.670075,
       f_low    f_high
 6  0.666093  0.670438,
       f_low    f_high
 7  0.765602  0.770028
 8  0.766884  0.771307,
       f_low    f_high
 9  0.775986  0.780398,
       f_low    f_high
 10  0.79459  0.798965]

Upvotes: 2

BallpointBen
BallpointBen

Reputation: 13750

Does this work?

intervals = df.apply(lambda row: pd.Interval(row['f_low'], row['f_high']), axis=1)
overlaps = [
    (i, j, x, y, x.overlaps(y)) 
    for ((i,x),(j,y))
    in itertools.product(enumerate(intervals), repeat=2)
]

>>> overlaps[:3]
[(0,
  0,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.47620100000000004, 0.481915, closed='right'),
  True),
 (0,
  1,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.47916099999999995, 0.48497700000000005, closed='right'),
  True),
 (0,
  2,
  Interval(0.47620100000000004, 0.481915, closed='right'),
  Interval(0.485997, 0.491911, closed='right'),
  False)]

From this you can get the numeric indices in the original DataFrame. Not sure how performant it is, but it should be better than what you've got now.

Upvotes: 1

Related Questions