Random DNA mutation Generator

Question

I'd like to create a dictionary of dictionaries for a series of mutated DNA strands, with each dictionary demonstrating the original base as well as the base it has mutated to.

To elaborate, what I would like to do is create a generator that allows one to input a specific DNA strand and have it crank out 100 randomly generated strands that have a mutation frequency of 0.66% (this applies to each base, and each base can mutate to any other base). Then, what I would like to do is create a series of dictionary, where each dictionary details the mutations that occured in a specific randomly generated strand. I'd like the keys to be the original base, and the values to be the new mutated base. Is there a straightforward way of doing this? So far, I've been experimenting with a loop that looks like this:

#yields a strand with an A-T mutation frequency of 0.066%
def mutate(string, mutation, threshold):
    dna = list(string)
    for index, char in enumerate(dna):
        if char in mutation:
            if random.random() < threshold:
                dna[index] = mutation[char]

    return ''.join(dna)

dna = "ATGTCGTACGTTTGACGTAGAG"
print("DNA first:", dna)
newDNA = mutate(dna, {"A": "T"}, 0.0066)
print("DNA now:", newDNA)

But I can only yield one strand with this code, and it only focuses on T-->A mutations. I'm also not sure how to tie the dictionary into this. could someone show me a better way of doing this? Thanks.

Blckknght · Accepted Answer

It sounds like there are two parts to your issue. The first is that you want to mutate your DNA sequence several times, and the second is that you want to gather some additional information about the mutations in a data structure of some kind. I'll handle each of those separately.

Producing 100 random results from the same source string is pretty easy. You can do it with an explicit loop (for instance, in a generator function), but you can just as easily use a list comprehension to run a single-mutation function over and over:

results = [mutate(original_string) for _ in range(100)]

Of course, if you make the mutate function more complicated, this simple code may not be appropriate. If it returns some kind of more sophisticated data structure, rather than just a string, you may need to do some additional processing to combine the data in the format you want.

As for how to build those data structures, I think the code you have already is a good start. You'll need to decide how exactly you're going to be accessing your data, and then let that guide you to the right kind of container.

For instance, if you just want to have a simple record of all the mutations that happen to a string, I'd suggest a basic list that contains tuples of the base before and after the mutation. On the other hand, if you want to be able to efficiently look up what a given base mutates to, a dictionary with lists as values might be more appropriate. You could also include the index of the mutated base if you wanted to.

Here's a quick attempt at a function that returns the mutated string along with a list of tuples recording all the mutations:

bases = "ACGT"

def mutate(orig_string, mutation_rate=0.0066):
    result = []
    mutations = []
    for base in orig_string:
        if random.random() < mutation_rate:
            new_base = bases[bases.index(base) - random.randint(1, 3)] # negatives are OK
            result.append(new_base)
            mutations.append((base, new_base))
        else:
            result.append(base)
    return "".join(result), mutations

The most tricky bit of this code is how I'm picking the replacement of the current base. The expression bases[bases.index(base) - random.randint(1, 3)] does it all in one go. Lets break down the different bits. bases.index(base) gives the index of the previous base in the global bases string at the top of the code. Then I subtract a random offset from this index (random.randint(1, 3)). The new index may be negative, but that's OK, as when we use it to index back into the bases string (bases[...]), negative indexes count from the right, rather than the left.

Here's how you could use it:

string = "ATGT"
results = [mutate(string) for _ in range(100)]
for result_string, mutations in results:
    if mutations: # skip writing out unmutated strings
        print(result_string, mutations)

For short strings, like "ATGT" you're very unlikely to get more than one mutation, and even one is pretty rare. The loop above tends to print between 2 and 4 results on each run (that is, more than 95% of length-four strings are not mutated at all). Longer strings will have mutations more often, and it's more plausible that you'll see multiple mutations in one string.

Random DNA mutation Generator

Answers (1)

Related Questions