H.F.S C.
H.F.S C.

Reputation: 109

define a loop which returns different possibilities

Hello I am pretty new to python. I have the following problem: I want to write a script that, given a (dna) sequence with ambiguities, writes all possible sequences, (if there are less than 100, if there are more than 100 possible sequences, an appropriate error message is printed) For DNA nucleotide ambiguities: http://www.bioinformatics.org/sms/iupac.html

Example: for the sequence “AYGH” the script’s output would be “ACGA”, “ACGC”, “ACGT”, “ATGA”, “ATGC”, and “ATGT”. A, C, G and T are the default nucleotides. ALL others can have different values (see link).

So i wrote this:

def possible_sequences (seq):
    poss_seq = ''
    for i in seq:
        if i=='A'or i=='C'or i=='G'or i=='T': 
            poss_seq += i 
        else: 
            if i== 'R':  
                poss_seq += 'A' # OR 'G', how should i implement this? 
            elif i == 'Y': 
                poss_seq += 'C' # OR T 
            elif i == 'S': 
                poss_seq += 'G' # OR C
            elif i == 'W': 
                poss_seq += 'A' # OR T 
            elif i == 'K': 
                poss_seq += 'G' # OR T
            elif i == 'M': 
                poss_seq += 'A' # OR C
            elif i == 'B': 
                poss_seq += 'C' # OR G OR T 
            elif i == 'D': 
                poss_seq += 'A' # OR G OR T 
            elif i == 'H': 
                poss_seq += 'A' # OR C OR T 
            elif i == 'V': 
                poss_seq += 'A' # OR C OR G 
            elif i == 'N': 
                poss_seq += 'A' # OR C OR G OR T 
            elif i == '-' or i == '.': 
                poss_seq += ' '
    return poss_seq

when I test my function: possible_sequences ('ATRY-C') i got:

'ATAC C'

but i should have get:

'ATAC C'
'ATAT C' 
'ATGC C'
'ATGT C'

Can somebody please help me? I understand that I have to recap the and write a second poss_seq when there is an ambiguity present but I don't know how...

Upvotes: 0

Views: 150

Answers (1)

niemmi
niemmi

Reputation: 17263

You can use itertools.product to generate the possibilities:

from itertools import product

# List possible nucleotides for each possible item in sequence
MAP = {
    'A': 'A',
    'C': 'C',
    'G': 'G',
    'T': 'T',
    'R': 'AG',
    'Y': 'CT',
    'S': 'GC',
    'W': 'AT',
    'K': 'GT',
    'M': 'AC',
    'B': 'CGT',
    'D': 'AGT',
    'H': 'ACT',
    'V': 'ACG',
    'N': 'ACGT',
    '-': ' ',
    '.': ' '
}

def possible_sequences(seq):
    return (''.join(c) for c in product(*(MAP[c] for c in seq)))

print(list(possible_sequences('AYGH')))
print(list(possible_sequences('ATRY-C')))

Output:

['ACGA', 'ACGC', 'ACGT', 'ATGA', 'ATGC', 'ATGT']
['ATAC C', 'ATAT C', 'ATGC C', 'ATGT C']

In above we first iterate over the items in the given sequence and get the list of possible nucleotides for each item:

possibilities = [MAP[c] for c in 'ATRY-C']
print(possibilities)

# ['A', 'T', 'AG', 'CT', ' ', 'C']

Then the iterable is unpacked as arguments given to product which will return the cartesian product:

products = list(product(*['A', 'T', 'AG', 'CT', ' ', 'C']))
print(products)

# [('A', 'T', 'A', 'C', ' ', 'C'), ('A', 'T', 'A', 'T', ' ', 'C'), 
#  ('A', 'T', 'G', 'C', ' ', 'C'), ('A', 'T', 'G', 'T', ' ', 'C')]

Finally each one of the products is turned to a string with join:

print(list(''.join(p) for p in products))

# ['ATAC C', 'ATAT C', 'ATGC C', 'ATGT C']

Note that possible_sequences returns a generator instead of constructing all the possible sequences at once so you can easily stop the iteration whenever you want instead of having to wait every sequence to be generated.

Upvotes: 7

Related Questions