Paolo Lorenzini
Paolo Lorenzini

Reputation: 725

eliminate repeats in random module in python

I have a code which generates a random sequence:

import random
selection60 = {"A":20, "T":20, "G":30, "C":30}
sseq60=[]
for k in selection60:
    sseq60 = sseq60 + [k] * int(selection60[k])
    random.shuffle(sseq60)
sequence="".join(random.sample(sseq60, 100))

The output in this case is:

GACCCCTCTGTACTATTAAAAGGCGTCACCGCGCCGAAAGAGCTGCAAGGCAATAGTGGACCAGAATCAAACGAAGGATTGCTTAGGTAATGGAATACAA

However, I would like to implement something that checks as well that no repeats of longer then 10 bases will be created for example:

GACCCCCCCCCCCTATTAAAAGGCGTCATCGCGCCGAAAGAGTTGCAAGGCAATAGTGGAGCAGAATTAAACGAAGGATTGCTTAGGTAATGGAATAAAA

This sequence contains 11 Cs at the beginning and it should not be allowed, the distribution of the letters should be uniform, is the random.sample function doing it by itself or does this need to be implemented?

Upvotes: 0

Views: 106

Answers (3)

Jamie Deith
Jamie Deith

Reputation: 714

The easiest to code is to check your sample and toss it if there are too many repeats:

from collections import Counter
from random import sample

pool = Counter({"A":20, "T":20, "G":30, "C":30})
too_many = [''.join([k]*11) for k in pool]
fn_select = lambda p: ''.join(sample(list(p.elements()), sum(p.values())))
selection = fn_select(pool)
while any(t in selection for t in too_many):
    selection = fn_select(pool)
print(selection)

Some detail:

too_many is set up as a list of 'illegal' sequences, i.e. ['AAAAAAAAAAA', 'TTTTTTTTTTT', 'GGGGGGGGGGG', 'CCCCCCCCCCC'].

any(t in selection for t in too_many) will be True if any of those 4 sequences are present in the selection, in which case we want to start fresh with a new sample.

Depending on your preference, you could rewrite the same code using a while True: loop:

from collections import Counter
from random import sample
pool = Counter({"A":20, "T":20, "G":30, "C":30})
too_many = [''.join([k]*11) for k in pool]
while True:
    selection = ''.join(sample(list(pool.elements()), sum(pool.values())))
    if not any(t in selection for t in too_many):
        break
print(selection)

Upvotes: 1

charles
charles

Reputation: 1055

How about something like :

import re

concatenated = ["A"]*20 + ["T"]*20 + ["G"]*30 + ["C"]*30
while True:
    random.shuffle(concatenated)
    sequence = "".join(concatenated)
    # exit the loop since we have found a sequence not containing more than 10 repeats of any letter
    if not re.search("A{11,}|T{11,}|G{11,}|C{11,}", sequence):
        break

This will run until you find a sequence not containing more than 10 repeats in a row of any letter.

Upvotes: 1

Tim Roberts
Tim Roberts

Reputation: 54708

Truly random sampling is sometimes going to generate long series of repeats. However, in this case, you're doing it wrong. Do the random shuffle after you generate the whole list. Do the shuffle a couple of times, if you want.

import random
selection60 = {"A":20, "T":20, "G":30, "C":30}
sseq60=[]
for k in selection60:
    sseq60 = sseq60 + [k] * int(selection60[k])
random.shuffle(sseq60)
sequence="".join(random.sample(sseq60, 100))

Upvotes: 1

Related Questions