Reputation: 11
I have written this barbaric script to create permutations of a string of characters that contain n (up to n=4) $'s in all possible combinations of positions within the string. I will eventually .replace('$','(\\w)')
to use for mismatches in a dna search sequence. Because of the way I wrote the script, some of the permutations have less than the requested number of $'s. I then wrote a script to remove them, but it doesn't seem to be effective, and each time I run the removal script, it removes more of the unwanted permutations. In the code pasted below, you will see that I test the function with a simple sequence with 4 mismatches. I then run a series of removal scripts that count how many expressions are removed each time...in my experience, it takes about 8 times to remove all expressions with less than 4 wild-card $'s. I have a couple questions about this:
Is there a built in function for searches with 'n' mismatches? Maybe even in biopython? So far, I've seen the Paul_McGuire_regex function:
Search for string allowing for one mismatch in any location of the string,
which seems only to generate 1 mismatch. I must admit, I don't fully understand all of the code in the remainining functions on that page, as I am a very new coder.
Since I see this as a good exercise for me, is there a better way to write this entire script?...Can I iterate Paul_McGuire_regex function as many times as I need?
Most perplexing to me, why won't the removal script work 100% the first time?
Thanks for any help you can provide!
def Mismatch(Search,n):
List = []
SearchL = list(Search)
if n > 4:
return("Error: Maximum of 4 mismatches")
for i in range(0,len(Search)):
if n == 1:
SearchL_i = list(Search)
SearchL_i[i] = '$'
List.append(''.join(SearchL_i))
if n > 1:
for j in range (0,len(Search)):
if n == 2:
SearchL_j = list(Search)
SearchL_j[i] = '$'
SearchL_j[j] = '$'
List.append(''.join(SearchL_j))
if n > 2:
for k in range(0,len(Search)):
if n == 3:
SearchL_k = list(Search)
SearchL_k[i] = '$'
SearchL_k[j] = '$'
SearchL_k[k] = '$'
List.append(''.join(SearchL_k))
if n > 3:
for l in range(0,len(Search)):
if n ==4:
SearchL_l = list(Search)
SearchL_l[i] = '$'
SearchL_l[j] = '$'
SearchL_l[k] = '$'
SearchL_l[l] = '$'
List.append(''.join(SearchL_l))
counter=0
for el in List:
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
List_RE = Mismatch('abcde',4)
counter = 0
for el in List_RE:
if el.count('$') < 4:
List_RE.remove(el)
counter+=1
print("Filter2="+str(counter))
Upvotes: 1
Views: 1487
Reputation: 56915
We can do away with questions 2 and 3 by answering question 1, but understanding question 3 is important so I'll do that first and then show how you can avoid it entirely:
As to question 3, it's because when you loop over a list in python and make changes to it within the loop, the list that you loop over changes.
From the python docs on control flow (for statement section):
It is not safe to modify the sequence being iterated over in the loop (this can only happen for mutable sequence types, such as lists).
Say your list is [a,b,c,d]
and you loop through it with for el in List
.
Say el
is currently a
and you do List.remove(el)
.
Now, your list is [b,c,d]
. However, the iterator points to the second element in the list (since it's done the first), which is now c
.
In essence, you've skipped b
. So the problem is that you are modifying the list you are iterating over.
There are a few ways to fix this: if your List
is not expensive to duplicate, you could make a copy. So iterate over List[:]
but remove from List
.
But suppose it's expensive to make copies of List
all the time.
Then what you do is iterate over it backwards. Note the reversed
below:
for el in reversed(List):
if el.count('$') < n:
counter+=1
List.remove(el)
return(List)
In the example above, suppose we iterate backwards over List
.
The iterator starts at d
, and then goes to c
.
Suppose we remove c
, so that List=[a,b,d]
.
Since the iterator is going backwards, it now points to element b
, so we haven't skipped anything.
Basically, this avoids modifying bits of the list you have yet to iterate over.
If I understand your question correctly, you basically want to choose n
out of m
positions, where m
is the length of the string (abcde
), and place a '$' in each of these n
positions.
In that case, you can use the itertools
module to do that.
import itertools
def Mismatch(Search,n):
SearchL = list(Search)
List = [] # hold output
# print list of indices to replace with '$'
idxs = itertools.combinations(range(len(SearchL)),n)
# for each combination `idx` in idxs, replace str[idx] with '$':
for idx in idxs:
str = SearchL[:] # make a copy
for i in idx:
str[i]='$'
List.append( ''.join(str) ) # convert back to string
return List
Let's look at how this works:
Search
string into a list so it can be iterated over, create empty List
to hold results.idxs = itertools.combinations(range(len(SearchL)),n)
says "find all subsets of length n in the set [0,1,2,3,...,length-of-search-string -1]
.
Try
idxs = itertools.combinations(range(5),4)
for idx in idxs:
print idx
to see what I mean.
idxs
is a tuple of n
indices from 0 to len(SearchL)-1
(e.g. (0,1,2,4)
. Replace the i'th character of SearchL
with a '$' for each i
in the tuple.List
.As an example:
Mismatch('abcde',3)
['$$$de', '$$c$e', '$$cd$', '$b$$e', '$b$d$', '$bc$$', 'a$$$e', 'a$$d$', 'a$c$$', 'ab$$$']
Mismatch('abcde',4) # note, the code you had made lots of duplicates.
['$$$$e', '$$$d$', '$$c$$', '$b$$$', 'a$$$$']
Upvotes: 3