arjan-hada
arjan-hada

Reputation: 269

Compute score of a list of motifs

I have implemented a function that does this. For every column in picture it takes the most common element and subtracts that from total number of elements in that column. It then takes these numbers and sums them up. This image shows what the function does.

Is there any way to make it faster? Here is my code for it:-

def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
    y = ''
    for j in range(len(motifs)):
        y += motifs[j][i]
    z.append(y)
print z
totalscore = 0
for string in z:
    score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
    totalscore += score
return totalscore  

motifs = ['GCG','AAG','AAG','ACG','CAA']
scoreMotifs(motifs)
['GAAAC', 'CAACA', 'GGGGA']
5

Upvotes: 0

Views: 1593

Answers (1)

Dair
Dair

Reputation: 16240

Ok, so I used line_profiler to analyze your code:

from random import randrange

@profile
def scoreMotifs(motifs):
    '''This function computes the score of list of motifs'''
    z = []
    for i in range(len(motifs[0])):
        y = ''
        for j in range(len(motifs)):
            y += motifs[j][i]
        z.append(y)
    totalscore = 0
    for string in z:
        score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
        totalscore += score
    return totalscore   

def random_seq():
    dna_mapping = ['T', 'A', 'C', 'G']
    return ''.join([dna_mapping[randrange(4)] for _ in range(3)])

motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)

These were the results:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           
     4                                           
     5                                           
     6         1            4      4.0      0.0  
     7         4           14      3.5      0.0  
     8         3            2      0.7      0.0  
     9   3000003      1502627      0.5     41.7  
    10   3000000      2075204      0.7     57.5  
    11         3           22      7.3      0.0  
    12         1            1      1.0      0.0  
    13         4            4      1.0      0.0  
    14         3        29489   9829.7      0.8  
    15         3            5      1.7      0.0  
    16         1            1      1.0      0.0  
Total Time: 3.60737 s

There is a huge amount of computation with the:

y += motifs[j][i]

There is a much better way of transposing your strings though, using the zip trick. Therefore you can rewrite your code to:

from random import randrange

@profile
def scoreMotifs(motifs):
    '''This function computes the score of list of motifs'''
    z = zip(*motifs)
    totalscore = 0
    for string in z:
        score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
        totalscore += score
    return totalscore  

def random_seq():
    dna_mapping = ['T', 'A', 'C', 'G']
    return ''.join([dna_mapping[randrange(4)] for _ in range(3)])


motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)

motifs = ['GCG','AAG','AAG','ACG','CAA']
print scoreMotifs(motifs)

The total time:

Total time: 0.61699 s

I'd say that is a pretty nice improvement.

Upvotes: 2

Related Questions