jhjudd
jhjudd

Reputation: 11

Multiple mismatches in DNA search sequence regex

I have written this barbaric script to create permutations of a string of characters that contain n (up to n=4) $'s in all possible combinations of positions within the string. I will eventually .replace('$','(\\w)') to use for mismatches in a dna search sequence. Because of the way I wrote the script, some of the permutations have less than the requested number of $'s. I then wrote a script to remove them, but it doesn't seem to be effective, and each time I run the removal script, it removes more of the unwanted permutations. In the code pasted below, you will see that I test the function with a simple sequence with 4 mismatches. I then run a series of removal scripts that count how many expressions are removed each time...in my experience, it takes about 8 times to remove all expressions with less than 4 wild-card $'s. I have a couple questions about this:

  1. Is there a built in function for searches with 'n' mismatches? Maybe even in biopython? So far, I've seen the Paul_McGuire_regex function:
    Search for string allowing for one mismatch in any location of the string,
    which seems only to generate 1 mismatch. I must admit, I don't fully understand all of the code in the remainining functions on that page, as I am a very new coder.

  2. Since I see this as a good exercise for me, is there a better way to write this entire script?...Can I iterate Paul_McGuire_regex function as many times as I need?

  3. Most perplexing to me, why won't the removal script work 100% the first time?

Thanks for any help you can provide!

def Mismatch(Search,n):
    List = []
    SearchL = list(Search)
    if n > 4:
        return("Error: Maximum of 4 mismatches")
    for i in range(0,len(Search)):
        if n == 1:
            SearchL_i = list(Search)
            SearchL_i[i] = '$'
            List.append(''.join(SearchL_i))
        if n > 1:
            for j in range (0,len(Search)):
                if n == 2:
                    SearchL_j = list(Search)
                    SearchL_j[i] = '$'
                    SearchL_j[j] = '$'
                    List.append(''.join(SearchL_j))
                if n > 2:
                    for k in range(0,len(Search)):
                        if n == 3:
                            SearchL_k = list(Search)
                            SearchL_k[i] = '$'
                            SearchL_k[j] = '$'
                            SearchL_k[k] = '$'
                            List.append(''.join(SearchL_k))
                        if n > 3:
                            for l in range(0,len(Search)):
                                if n ==4:
                                    SearchL_l = list(Search)
                                    SearchL_l[i] = '$'
                                    SearchL_l[j] = '$'
                                    SearchL_l[k] = '$'
                                    SearchL_l[l] = '$'
                                    List.append(''.join(SearchL_l))
    counter=0
    for el in List:
        if el.count('$') < n:
            counter+=1
            List.remove(el)
    return(List) 

List_RE = Mismatch('abcde',4)

counter = 0
for el in List_RE:
    if el.count('$') < 4:
        List_RE.remove(el)
        counter+=1

print("Filter2="+str(counter))

Upvotes: 1

Views: 1487

Answers (1)

mathematical.coffee
mathematical.coffee

Reputation: 56915

We can do away with questions 2 and 3 by answering question 1, but understanding question 3 is important so I'll do that first and then show how you can avoid it entirely:

Question 3

As to question 3, it's because when you loop over a list in python and make changes to it within the loop, the list that you loop over changes.

From the python docs on control flow (for statement section):

It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable sequence types, such as lists).

Say your list is [a,b,c,d] and you loop through it with for el in List. Say el is currently a and you do List.remove(el).

Now, your list is [b,c,d]. However, the iterator points to the second element in the list (since it's done the first), which is now c. In essence, you've skipped b. So the problem is that you are modifying the list you are iterating over.

There are a few ways to fix this: if your List is not expensive to duplicate, you could make a copy. So iterate over List[:] but remove from List.

But suppose it's expensive to make copies of List all the time. Then what you do is iterate over it backwards. Note the reversed below:

for el in reversed(List):
    if el.count('$') < n:
        counter+=1
        List.remove(el)
return(List) 

In the example above, suppose we iterate backwards over List. The iterator starts at d, and then goes to c. Suppose we remove c, so that List=[a,b,d]. Since the iterator is going backwards, it now points to element b, so we haven't skipped anything.

Basically, this avoids modifying bits of the list you have yet to iterate over.

Questions 1 & 2

If I understand your question correctly, you basically want to choose n out of m positions, where m is the length of the string (abcde), and place a '$' in each of these n positions.

In that case, you can use the itertools module to do that.

import itertools
def Mismatch(Search,n):
    SearchL = list(Search)
    List     = [] # hold output
    # print list of indices to replace with '$'
    idxs = itertools.combinations(range(len(SearchL)),n)        
    # for each combination `idx` in idxs, replace str[idx] with '$':
    for idx in idxs:
        str = SearchL[:] # make a copy
        for i in idx:
            str[i]='$'
        List.append( ''.join(str) ) # convert back to string
    return List

Let's look at how this works:

  1. turn the Search string into a list so it can be iterated over, create empty List to hold results.
  2. idxs = itertools.combinations(range(len(SearchL)),n) says "find all subsets of length n in the set [0,1,2,3,...,length-of-search-string -1]. Try

    idxs = itertools.combinations(range(5),4)
    for idx in idxs: 
        print idx
    

    to see what I mean.

  3. Each element of idxs is a tuple of n indices from 0 to len(SearchL)-1 (e.g. (0,1,2,4). Replace the i'th character of SearchL with a '$' for each i in the tuple.
  4. Convert the result back into a string and add it to List.

As an example:

Mismatch('abcde',3)
['$$$de', '$$c$e', '$$cd$', '$b$$e', '$b$d$', '$bc$$', 'a$$$e', 'a$$d$', 'a$c$$', 'ab$$$']
Mismatch('abcde',4) # note, the code you had made lots of duplicates.
['$$$$e', '$$$d$', '$$c$$', '$b$$$', 'a$$$$'] 

Upvotes: 3

Related Questions