tadeufontes
tadeufontes

Reputation: 477

How to search for patterns in a list of strings without 100% similarity?

So I have a list of strings:

input_list=["ACTGATCTTATCGAGTCAGCTAGTCGATCGATCGACGCGCGATCGTGATG","TGCATCGATCGATGCTAGTCGATATACGCGATATGTACG","CATCGGATCGATCGATCAGCTCATAGTCAGTC","CATCGATCATATATCGAGCGACAGTCAGTCGATCAGTCATCAGGTAGC","CATCATATCGAGCAGTCATCGTAGTCATGATCGATCGAT","ACATGAATCGATCGATAATCGATCGCGATTCGATG"]

And a list of tuples containing patterns to be looked for in the strings.

list_patterns=[("AGTG","TCGC"),("TATC","ATGT"),("GCAT","TCAG")]

I have this function that, for each string, looks for (any) one of the pairs of patterns from "list_patterns". The first element of each tuple in "list_patterns" is searched for from the beggining of the strings, the second element of each tuple is searched for from the end of each string. Subsequently, the function trims the string, appending the trimmed string to an empty list (if none of the pairs of patterns is found, it just appends the original untrimmed string).

trimmed_list = []
for el in input_list:
    for pat in list_patterns:
        beg = el.find(pat[0])
        end = el.rfind(pat[1])
        if(beg != -1 and end != -1):
            output_list.append(el[beg+len(pat[0]):end])
            break
    else:
        output_list.append(el)

The thing is, I want to trim and find the patterns, but not necessarily match only the ones that have 100% similarity. I want to also find the patterns that are somewhat similar (by a user-defined threshold) and trim the strings accordingly.

I found this function that retrieves the ratio of similarity between two strings, but I'm not able to implement it into my original function:

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

As an example, let's say I had a string:

"ATGCATCGTACGTACGTACG"

And a tuple of patterns which may be slightly different from the original ones:

("AYTG","TARYCG")

Even thought the string does not contain exactly those patterns (but contains similar ones), I would still like to trim it (ATG | CATCGTACGTACG | TACG), and have an output:

"CATCGTACGTACG"

Is there an easy way to add the "SequenceMatcher" function to my user-defined function? Thank you so much in advance for any answers.

Upvotes: 0

Views: 111

Answers (1)

Jolbas
Jolbas

Reputation: 752

You can create a custom find function and call it instead of the string.find or string.rfind methods. This function just looks at string parts that has the same length as the pattern. If different lengths can match it has to be extended but it's hard to predict how much.

def find_similar(needle, haystack, backwards = False, min_diff_ratio = 0.75):
    n_len = len(needle)
    # create a range depending on direction
    if backwards:
        r = range(len(haystack)-n_len, -1, -1)
    else:
        r = range(len(haystack) - n_len)
    for i in r:
        # create a substring with same length as search string
        # at index
        substr = haystack[i:i + n_len]
        # Here we check for similarity using your function
        if similar(needle, substr) >= min_diff_ratio:
            return i
    return -1

Update your loop to this

trimmed_list = []
for el in input_list:
    for pat in list_patterns:
        beg = find_similar(pat[0], el)
        end = find_similar(pat[1], el, True)
        if(beg != -1 and end != -1):
            output_list.append(el[beg+len(pat[0]):end])
            break
    else:
        output_list.append(el)

Upvotes: 1

Related Questions