Reputation: 725
I have a code which generates a random sequence:
import random
selection60 = {"A":20, "T":20, "G":30, "C":30}
sseq60=[]
for k in selection60:
sseq60 = sseq60 + [k] * int(selection60[k])
random.shuffle(sseq60)
sequence="".join(random.sample(sseq60, 100))
The output in this case is:
GACCCCTCTGTACTATTAAAAGGCGTCACCGCGCCGAAAGAGCTGCAAGGCAATAGTGGACCAGAATCAAACGAAGGATTGCTTAGGTAATGGAATACAA
However, I would like to implement something that checks as well that no repeats of longer then 10 bases will be created for example:
GACCCCCCCCCCCTATTAAAAGGCGTCATCGCGCCGAAAGAGTTGCAAGGCAATAGTGGAGCAGAATTAAACGAAGGATTGCTTAGGTAATGGAATAAAA
This sequence contains 11 Cs at the beginning and it should not be allowed, the distribution of the letters should be uniform, is the random.sample function doing it by itself or does this need to be implemented?
Upvotes: 0
Views: 106
Reputation: 714
The easiest to code is to check your sample and toss it if there are too many repeats:
from collections import Counter
from random import sample
pool = Counter({"A":20, "T":20, "G":30, "C":30})
too_many = [''.join([k]*11) for k in pool]
fn_select = lambda p: ''.join(sample(list(p.elements()), sum(p.values())))
selection = fn_select(pool)
while any(t in selection for t in too_many):
selection = fn_select(pool)
print(selection)
Some detail:
too_many
is set up as a list of 'illegal' sequences, i.e. ['AAAAAAAAAAA', 'TTTTTTTTTTT', 'GGGGGGGGGGG', 'CCCCCCCCCCC']
.
any(t in selection for t in too_many)
will be True
if any of those 4 sequences are present in the selection
, in which case we want to start fresh with a new sample.
Depending on your preference, you could rewrite the same code using a while True:
loop:
from collections import Counter
from random import sample
pool = Counter({"A":20, "T":20, "G":30, "C":30})
too_many = [''.join([k]*11) for k in pool]
while True:
selection = ''.join(sample(list(pool.elements()), sum(pool.values())))
if not any(t in selection for t in too_many):
break
print(selection)
Upvotes: 1
Reputation: 1055
How about something like :
import re
concatenated = ["A"]*20 + ["T"]*20 + ["G"]*30 + ["C"]*30
while True:
random.shuffle(concatenated)
sequence = "".join(concatenated)
# exit the loop since we have found a sequence not containing more than 10 repeats of any letter
if not re.search("A{11,}|T{11,}|G{11,}|C{11,}", sequence):
break
This will run until you find a sequence not containing more than 10 repeats in a row of any letter.
Upvotes: 1
Reputation: 54708
Truly random sampling is sometimes going to generate long series of repeats. However, in this case, you're doing it wrong. Do the random shuffle after you generate the whole list. Do the shuffle a couple of times, if you want.
import random
selection60 = {"A":20, "T":20, "G":30, "C":30}
sseq60=[]
for k in selection60:
sseq60 = sseq60 + [k] * int(selection60[k])
random.shuffle(sseq60)
sequence="".join(random.sample(sseq60, 100))
Upvotes: 1