NJD
NJD

Reputation: 199

Filter a list of integer ranges by both a maximum overlap percentage and value

I have a list of ranges such as:

[12-48,40-80,60-105,110-130,75-400]

And I need to filter out or remove the ranges which overlap more than x digits (So overlap more than 10 for example) and/or overlap more than x% (Lets say 20%) of the smallest of the compared ranges.

At the moment I use a for loop to check each range at a time and compare it to the next to see whether they overlap past my stated limits, and if so, remove it. This does not work as with in the example I show I get this result:

[12-48,75-400]

The range [40-80] should not have been removed because it does not overlap with our 2 remaining ranges past the limits but because it overlapped [60-105] and was the smaller of the 2, it was removed. The correct remaining ranges should be:

[12-48,40-80,75-400]

I do not think a simple for loop is the solution here but I am at a loss. Please let me know if anything is unclear.

Current Code

The parts with GeneA/GenePrev/GeneAND are how I calculate the % overlap and can be ignored.

        start = int(key.split(',')[0])
        stop = int(key.split(',')[1])
        length = stop - start
        if First == True:
            Both_Frames[key] = value
            First = False
            GeneA[start:stop] = [1] * (stop - start)
            GenePrev = GeneA
            PrevStart = start
            PrevStop = stop
            prevlength = PrevStop - PrevStart
        else:
            GeneA[start:stop] = [1] * (stop - start)
            Gene_AND = GenePrev & GeneA

            if start == PrevStart:
                GenePrev = GeneA
                
                ######Need to delete item from dictionary which is overlapping
                Both_Frames.popitem(last=False)
                Both_Frames[key] = value
                PrevStart = start
                PrevStop = stop
                prevlength = PrevStop - PrevStart
            elif start >= PrevStart and stop <= PrevStop:
           
                continue
            elif  np.count_nonzero(Gene_AND) <= (length * OverLapPercentage) and np.count_nonzero(Gene_AND) <= OverLapNT:
                GenePrev = GeneA
                Both_Frames[key] = value
                PrevStart = start
                PrevStop = stop
                prevlength = PrevStop - PrevStart

            elif np.count_nonzero(Gene_AND) >= (length * OverLapPercentage) or np.count_nonzero(Gene_AND) >= OverLapNT:
                if length > prevlength:
                    GenePrev = GeneA
                 
                    Both_Frames.popitem(last=False)
                    Both_Frames[key] = value
                    PrevStart = start
                    PrevStop = stop
                    prevlength = PrevStop - PrevStart

Upvotes: 1

Views: 214

Answers (1)

Alex
Alex

Reputation: 7045

I may have a convoluted solution for you:

First, I convert your ranges to a list of tuples of int:

import pandas as pd


r = ["12-48", "40-80", "60-105", "110-130", "75-400"]
r = [tuple(map(int, z.split("-"))) for z in r]

# [(12, 48), (40, 80), (60, 105), (110, 130), (75, 400)]

Then, I iterate all the ranges and remove any that are entirely encapsulated by another range. E.g: (110, 130) is within (75, 400):

hold = []
for idx1 in range(len(r)):
    start_1, stop_1 = r[idx1]
    for idx2, (start_2, stop_2) in enumerate(r):
        if idx1 == idx2:
            continue
        if start_2 < start_1 and stop_1 < stop_2:
            hold.append(idx1)

while hold:
    del r[hold.pop()]

# [(12, 48), (40, 80), (60, 105), (75, 400)]

Finally, using a pandas.DataFrame I calculate the overlap and percentage overlap; label the rows that meet your exclusion criteria (overlap > 10 and % > 0.2). These rows are then removed in reverse order, and the overlaps are tested again after each removal until no more rows can be removed.

The DataFrame is then converted back into a list of strings in the same format they were supplied in.

df = pd.DataFrame(r, columns=["start", "stop"]).sort_values("start")

df["length"] = df["stop"] - df["start"]
df["bool_1"], df["bool_2"] = True, True

while any(df["bool_1"].eq(True) & df["bool_2"].eq(True)):
    df["overlap"] = df["stop"] - df["start"].shift(-1)
    df["pc"] = df["overlap"] / df["length"]

    df["bool_1"] = df["overlap"] > 10
    df["bool_2"] = df["pc"] > 0.2
    for i, row in df.sort_index(ascending=False).iterrows():
        if row["bool_1"] == row["bool_2"] and row["bool_1"] is not False:
            df.drop(i, inplace=True)
            break

result = df["start"].astype("str").str.cat(df["stop"].astype("str"), sep="-").to_list()

# ['12-48', '40-80', '75-400']

Upvotes: 1

Related Questions