Reputation: 269
I have implemented a function that does this. For every column in picture it takes the most common element and subtracts that from total number of elements in that column. It then takes these numbers and sums them up. This image shows what the function does.
Is there any way to make it faster? Here is my code for it:-
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
y = ''
for j in range(len(motifs)):
y += motifs[j][i]
z.append(y)
print z
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
motifs = ['GCG','AAG','AAG','ACG','CAA']
scoreMotifs(motifs)
['GAAAC', 'CAACA', 'GGGGA']
5
Upvotes: 0
Views: 1593
Reputation: 16240
Ok, so I used line_profiler to analyze your code:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = []
for i in range(len(motifs[0])):
y = ''
for j in range(len(motifs)):
y += motifs[j][i]
z.append(y)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
These were the results:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3
4
5
6 1 4 4.0 0.0
7 4 14 3.5 0.0
8 3 2 0.7 0.0
9 3000003 1502627 0.5 41.7
10 3000000 2075204 0.7 57.5
11 3 22 7.3 0.0
12 1 1 1.0 0.0
13 4 4 1.0 0.0
14 3 29489 9829.7 0.8
15 3 5 1.7 0.0
16 1 1 1.0 0.0
Total Time: 3.60737 s
There is a huge amount of computation with the:
y += motifs[j][i]
There is a much better way of transposing your strings though, using the zip
trick. Therefore you can rewrite your code to:
from random import randrange
@profile
def scoreMotifs(motifs):
'''This function computes the score of list of motifs'''
z = zip(*motifs)
totalscore = 0
for string in z:
score = len(string)-max([string.count('A'),string.count('C'), string.count('G'), string.count('T')])
totalscore += score
return totalscore
def random_seq():
dna_mapping = ['T', 'A', 'C', 'G']
return ''.join([dna_mapping[randrange(4)] for _ in range(3)])
motifs = [random_seq() for _ in range(1000000)]
print scoreMotifs(motifs)
motifs = ['GCG','AAG','AAG','ACG','CAA']
print scoreMotifs(motifs)
The total time:
Total time: 0.61699 s
I'd say that is a pretty nice improvement.
Upvotes: 2