Reputation: 504
I have this functons to simulate mutations on DNA sequences( sequence of letter -> 'ACGTGCTTAGG', for exemple).
The first one just change a random position of the input sequence
def mutate(sequence):
seq_lst = list(sequence)
i = random.randint(0, len(seq_lst) - 1)
seq_lst[i] = random.choice(list('ATCG'))
return ''.join(seq_lst)
The second one is to simulate a insertion of a base inside a random position in the sequence.
def insertion(sequence):
seq_lst = list(sequence)
i = random.randint(0, len(seq_lst) - 1)
mutate = seq_lst[:i] + [random.choice(list('ATCG'))] + seq_lst[i:]
return ''.join(mutate)]
The last one is to select all kinds of possible random mutations that can occur in a sequence.
def mutations(sequence):
i = random.randint(0, 3)
print(i)
if i == 0:
print('SNV')
return mutate(sequence)
elif i == 1:
print('Del')
return sequence.replace(random.choice('ATCG'), '-')
elif i == 2:
print('Ins')
return insertion(sequence)
elif i == 3:
print('No mut')
return sequence
The print statements are just to check if the code is working accordling.
Any suggestion for improvement? If possible suggestions how to insert mutations probabilities in the code to simulate a more real situation.
What I saw in the return of 10000 random process is that the sequence accumulates a lot of deletions, what is wrong once single point mutations are more frequente, followed by insertions and deletions with less frequency.
Thanks
Upvotes: 2
Views: 74
Reputation: 636
There will be excessive numbers of deletions in the sequence simulation because of this,
import random
sequence = 'ACTCAG'
sequence.replace(random.choice('ATCG'), '-')
OUT
A-T-AG
Around 1/4 of the time two deletions will occur simultaneously for a given event. Thus the probabilities are not uniform, resulting in a higher chance of deletions than insertions (or mutations). Thus a 1/4 chance of a insert or deletion and an additional 1/4 chance of a double deletion vs. a 0 change for an insertion.
There are two other bias, You will generate reversion mutations A->A, so 1/4 mutations will not appear to mutate. This naturally occurs in DNA mutations but it is worth keeping mind.
Finally, once a deletion occurs there becomes increased mutation on the remaining nucleotides and essentially you will bias the system towards fewer and fewer nucleotides, so those that remain will undergo increased mutation.
In otherwords, the probabilities are not uniform and will be dynamic during the simulation.
You could instead used re.sub via the function below to ensure the probability between insert and deletion remain uniform,
import random, re
def rand_replacement(string, to_be_replaced, items):
return re.sub(to_be_replaced, lambda x: random.choice(items), string )
Upvotes: 2