Reputation: 2161
I have the following toy example dataframe, df
:
f_low f_high
0.476201 0.481915
0.479161 0.484977
0.485997 0.491911
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
0.666093 0.670438
0.765602 0.770028
0.766884 0.771307
0.775986 0.780398
0.794590 0.798965
to find overlapping subsets of this, I am using the following code:
df = df.sort_values('f_low')
for row in df.itertuples():
iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
span_range = pd.Interval(row.f_low, row.f_high)
fx = df[(iix.overlaps(span_range))].copy()
I would LIKE to get overlapping dataframes like this:
# iteration 1: over row.f_low=0.476201 row.f_high=0.481915
f_low f_high
0.476201 0.481915
0.479161 0.484977
# iteration 2: over row.f_low=0.503259 row.f_high=0.508679
f_low f_high
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
# iteration 3: over row.f_low=0.504687 row.f_high=0.670075
f_low f_high
0.666093 0.670438
etc.
This works great, but since the dataframe is quite large and there are a lot of overlaps, this takes a long time to process. Also, the interval I am testing for overlaps does not grab itself when using the Interval
and overlaps
methods for pandas.
What this is meant to represent is a series of overlapping confidence intervals with each row that gets iterated over.
Is there a way to more efficiently extract overlapping intervals against a given interval besides iterating through all the tuples?
Here is a chunk of the actual dataframe unsorted:
f_low f_high
0.504687 0.670075
0.476201 0.481915
0.765602 0.770028
0.479161 0.484977
0.766884 0.771307
0.485997 0.491911
0.666093 0.670438
0.503259 0.508679
0.775986 0.780398
0.504687 0.510075
0.794590 0.798965
Upvotes: 3
Views: 3669
Reputation: 294218
Treat "f_low"
values as entry points and assign a value of 1
. Treat "f_high"
values as exit points and assign a value of -1
. If we proceed through all values in increasing order and accumulate assigned values then we will have an overlapping interval when the accumulated value is greater than zero. We know we've exited any overlapping intervals if the accumulated value reaches zero.
This groups all intervals that continuously overlap. If an interval doesn't overlap with the first BUT does overlap with the last one in the chain, then it counts as overlapping.
I'll provide an analogous solution for the other option below this solution.
# 1 3 (Interval from 1 to 3)
# 2 5 (Interval from 2 to 5)
# 7 9 (Interval from 7 to 9)
# 1 1 -1 -1 1 -1 (Entry/Exit values)
# 1 2 1 0 1 0 (Accumulated values)
# ⇑ ⇑
# zero indicates leaving all overlaps
This indicates that we start once we enter into the interval from 1
to 3
, we don't leave all overlapping intervals until we get to 5
the right side of the interval from 2
to 5
as indicate by the accumulated value reaching zero.
I'll use a generator to return lists of the indices of the original dataframe that have overlapping intervals.
When all is said and done, this should be N * Log(N)
for the sorting involved.
def gen_overlaps(df):
df = df.sort_values('f_low')
# get sorter lows and highs
a = df.to_numpy().ravel().argsort()
# get free un-sorter
b = np.empty_like(a)
b[a] = np.arange(len(a))
# get ones and negative ones
# to indicate entering into
# and exiting an interval
c = np.ones(df.shape, int) * [1, -1]
# if we sort by all values and
# accumulate when we enter and exit
# the accumulated value should be
# zero when there are no overlaps
d = c.ravel()[a].cumsum()[b].reshape(df.shape)
# ⇑ ⇑
# sort by value order unsort to get back to original order
indices = []
for i, indicator in zip(df.index, d[:, 1] == 0):
indices.append(i)
if indicator:
yield indices
indices = []
if indices:
yield indices
Then I'll use pd.concat
to organize them to show what I mean. k
is the kth
group. Some groups only have one interval.
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 2 0.485997 0.491911
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
4 9 0.775986 0.780398
5 10 0.794590 0.798965
If we only wanted the ones that overlapped...
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
if len(i) > 1
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
This is a simpler solution and matches OPs desired output.
def gen_overlaps(df):
df = df.sort_values('f_low')
indices = []
cursor = None
for i, low, high in df.itertuples():
if not indices:
cursor = high
if low <= cursor:
indices.append(i)
else:
yield indices
indices = []
cursor = high
if len(indices) > 1:
yield indices
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
2 7 0.765602 0.770028
8 0.766884 0.771307
Upvotes: 4
Reputation: 191
I'm not sure what kind of overlapping do you need, but I think this approach can work for it:
query
must be better than .loc
import pandas as pd
df = pd.DataFrame(
[
[0.504687, 0.670075],
[0.476201, 0.481915],
[0.765602, 0.770028],
[0.479161, 0.484977],
[0.766884, 0.771307],
[0.485997, 0.491911],
[0.666093, 0.670438],
[0.503259, 0.508679],
[0.775986, 0.780398],
[0.504687, 0.510075],
[0.794590, 0.798965]
],
columns=["f_low", "f_high"]
)
overlap = {
(row.f_low, row.f_high): df.query("(@row.f_low <= f_low <= @row.f_high) or (@row.f_low <= f_high <= @row.f_high)")
for row in df.itertuples()
}
Upvotes: 0
Reputation: 93151
Using numpy's array broadcasting:
l1 = df['f_low'].to_numpy()
h1 = df['f_high'].to_numpy()
l2 = l1[:, None]
h2 = h1[:, None]
# Check for overlap
# mask is an n * n matrix indicating if interval i overlaps with interval j
mask = (l1 < h2) & (h1 > l2)
# If interval i overlaps intervla j then j also overlaps i. We only want to get
# one of the two pairs. Hence the `triu` (triangle, upper)
# Every interval also overlaps itself and we don't want that either. Hence the k=1
overlaps = np.triu(mask, k=1).nonzero()
The result in overlaps
requires some interpretations:
(array([0, 3, 3, 4, 5, 7]),
array([1, 4, 5, 5, 6, 8]))
# Row 0 overlaps with row 1
# Row 3 overlaps with row 4
# Row 3 overlaps with row 5
# ....
Upvotes: 1
Reputation: 11161
If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:
df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
low = df.f_low[idx]
high = df.f_high[idx]
sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
dfs.append(sub_df)
idx = sub_df.index.max() + 1
if idx > df.index.max():
break
output:
[ f_low f_high
0 0.476201 0.481915
1 0.479161 0.484977,
f_low f_high
2 0.485997 0.491911,
f_low f_high
3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075,
f_low f_high
6 0.666093 0.670438,
f_low f_high
7 0.765602 0.770028
8 0.766884 0.771307,
f_low f_high
9 0.775986 0.780398,
f_low f_high
10 0.79459 0.798965]
Upvotes: 2
Reputation: 13750
Does this work?
intervals = df.apply(lambda row: pd.Interval(row['f_low'], row['f_high']), axis=1)
overlaps = [
(i, j, x, y, x.overlaps(y))
for ((i,x),(j,y))
in itertools.product(enumerate(intervals), repeat=2)
]
>>> overlaps[:3]
[(0,
0,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47620100000000004, 0.481915, closed='right'),
True),
(0,
1,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47916099999999995, 0.48497700000000005, closed='right'),
True),
(0,
2,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.485997, 0.491911, closed='right'),
False)]
From this you can get the numeric indices in the original DataFrame. Not sure how performant it is, but it should be better than what you've got now.
Upvotes: 1