Reputation: 199
I have a list of ranges such as:
[12-48,40-80,60-105,110-130,75-400]
And I need to filter out or remove the ranges which overlap more than x digits (So overlap more than 10 for example) and/or overlap more than x% (Lets say 20%) of the smallest of the compared ranges.
At the moment I use a for loop to check each range at a time and compare it to the next to see whether they overlap past my stated limits, and if so, remove it. This does not work as with in the example I show I get this result:
[12-48,75-400]
The range [40-80] should not have been removed because it does not overlap with our 2 remaining ranges past the limits but because it overlapped [60-105] and was the smaller of the 2, it was removed. The correct remaining ranges should be:
[12-48,40-80,75-400]
I do not think a simple for loop is the solution here but I am at a loss. Please let me know if anything is unclear.
The parts with GeneA/GenePrev/GeneAND are how I calculate the % overlap and can be ignored.
start = int(key.split(',')[0])
stop = int(key.split(',')[1])
length = stop - start
if First == True:
Both_Frames[key] = value
First = False
GeneA[start:stop] = [1] * (stop - start)
GenePrev = GeneA
PrevStart = start
PrevStop = stop
prevlength = PrevStop - PrevStart
else:
GeneA[start:stop] = [1] * (stop - start)
Gene_AND = GenePrev & GeneA
if start == PrevStart:
GenePrev = GeneA
######Need to delete item from dictionary which is overlapping
Both_Frames.popitem(last=False)
Both_Frames[key] = value
PrevStart = start
PrevStop = stop
prevlength = PrevStop - PrevStart
elif start >= PrevStart and stop <= PrevStop:
continue
elif np.count_nonzero(Gene_AND) <= (length * OverLapPercentage) and np.count_nonzero(Gene_AND) <= OverLapNT:
GenePrev = GeneA
Both_Frames[key] = value
PrevStart = start
PrevStop = stop
prevlength = PrevStop - PrevStart
elif np.count_nonzero(Gene_AND) >= (length * OverLapPercentage) or np.count_nonzero(Gene_AND) >= OverLapNT:
if length > prevlength:
GenePrev = GeneA
Both_Frames.popitem(last=False)
Both_Frames[key] = value
PrevStart = start
PrevStop = stop
prevlength = PrevStop - PrevStart
Upvotes: 1
Views: 214
Reputation: 7045
I may have a convoluted solution for you:
First, I convert your ranges to a list
of tuples
of int
:
import pandas as pd
r = ["12-48", "40-80", "60-105", "110-130", "75-400"]
r = [tuple(map(int, z.split("-"))) for z in r]
# [(12, 48), (40, 80), (60, 105), (110, 130), (75, 400)]
Then, I iterate all the ranges and remove any that are entirely encapsulated by another range. E.g: (110, 130)
is within (75, 400)
:
hold = []
for idx1 in range(len(r)):
start_1, stop_1 = r[idx1]
for idx2, (start_2, stop_2) in enumerate(r):
if idx1 == idx2:
continue
if start_2 < start_1 and stop_1 < stop_2:
hold.append(idx1)
while hold:
del r[hold.pop()]
# [(12, 48), (40, 80), (60, 105), (75, 400)]
Finally, using a pandas.DataFrame
I calculate the overlap and percentage overlap; label the rows that meet your exclusion criteria (overlap > 10 and % > 0.2). These rows are then removed in reverse order, and the overlaps are tested again after each removal until no more rows can be removed.
The DataFrame is then converted back into a list of strings in the same format they were supplied in.
df = pd.DataFrame(r, columns=["start", "stop"]).sort_values("start")
df["length"] = df["stop"] - df["start"]
df["bool_1"], df["bool_2"] = True, True
while any(df["bool_1"].eq(True) & df["bool_2"].eq(True)):
df["overlap"] = df["stop"] - df["start"].shift(-1)
df["pc"] = df["overlap"] / df["length"]
df["bool_1"] = df["overlap"] > 10
df["bool_2"] = df["pc"] > 0.2
for i, row in df.sort_index(ascending=False).iterrows():
if row["bool_1"] == row["bool_2"] and row["bool_1"] is not False:
df.drop(i, inplace=True)
break
result = df["start"].astype("str").str.cat(df["stop"].astype("str"), sep="-").to_list()
# ['12-48', '40-80', '75-400']
Upvotes: 1