Abhi
Abhi

Reputation: 6255

introducing mutations in a DNA string in python

Given a DNA string for example AGC. I am trying to generate all possible uniq strings allowing upto #n (user defined number) mismatches in the given string.

I am able to do this for one mismatch in the following way but not able to implement the recursive solution to generate all the possible combinations based on #n mismatch, DNA string and mutation set(AGCTN)

temp_dict = {}
sequence = 'AGC'

    for x in xrange(len(sequence)):
        prefix = sequence[:x]
        suffix = sequence[x+1:]
        temp_dict.update([ (prefix+base+suffix,1) for base in 'ACGTN'])
    print temp_dict

An example:

for a given sample string : ACG, the following are the 13 uniq sequences allowing upto one mismatch

{'ACC': 1, 'ATG': 1, 'AAG': 1, 'ANG': 1, 'ACG': 1, 'GCG': 1, 'AGG': 1, 
'ACA': 1, 'ACN': 1, 'ACT': 1, 'TCG': 1, 'CCG': 1, 'NCG': 1}

I want to generalize this so that the program can take a 100 characters long DNA string and return a list/dict of uniq strings allowing user defined #mismatches

Thanks! -Abhi

Upvotes: 3

Views: 2660

Answers (2)

user2976243
user2976243

Reputation: 31

I believe the accepted answer only gives N mismatches, not up to N. A slight modification to the accepted answer should correct this I think:

from itertools import combinations,product

def mismatch(word, i = 2):
  for d in range(i+1):
    for locs in combinations(range(len(word)), d):
      thisWord = [[char] for char in word]
      for loc in locs:
        origChar = word[loc]
        thisWord[loc] = [l for l in "ACGT" if l != origChar]
      for poss in product(*thisWord):
        yield "".join(poss)    

kMerList = list(mismatch("AAAA",3))

print kMerList

I am completely new to programming, so please correct me if I'm wrong.

Upvotes: 3

DSM
DSM

Reputation: 353059

Assuming I understand you, I think you can use the itertools module. The basic idea is to choose locations where there's going to be a mismatch using combinations and then construct all satisfying lists using product:

import itertools

def mismatch(word, letters, num_mismatches):
    for locs in itertools.combinations(range(len(word)), num_mismatches):
        this_word = [[char] for char in word]
        for loc in locs:
            orig_char = word[loc]
            this_word[loc] = [l for l in letters if l != orig_char]
        for poss in itertools.product(*this_word):
            yield ''.join(poss)

For your example case:

>>> mismatch("ACG", "ACGTN", 0)
<generator object mismatch at 0x1004bfaa0>
>>> list(mismatch("ACG", "ACGTN", 0))
['ACG']
>>> list(mismatch("ACG", "ACGTN", 1))
['CCG', 'GCG', 'TCG', 'NCG', 'AAG', 'AGG', 'ATG', 'ANG', 'ACA', 'ACC', 'ACT', 'ACN']

Upvotes: 6

Related Questions