user2171391
user2171391

Reputation:

How to speed up Python string matching code

I have this code which computes the Longest Common Subsequence between random strings to see how accurately one can reconstruct an unknown region of the input. To get good statistics I need to iterate it many times but my current python implementation is far too slow. Even using pypy it currently takes 21 seconds to run once and I would ideally like to run it 100s of times.

#!/usr/bin/python

import random
import itertools
#test to see how many different unknowns are compatible with a set of LCS answers.
def lcs(x, y):
    n = len(x)
    m = len(y)
#    table is the dynamic programming table
    table = [list(itertools.repeat(0, n+1)) for _ in xrange(m+1)]
    for i in range(n+1):     # i=0,1,...,n
        for j in range(m+1):  # j=0,1,...,m
            if i == 0 or j == 0:
                table[i][j] = 0
            elif x[i-1] == y[j-1]:
                table[i][j] = table[i-1][j-1] + 1
            else:
                table[i][j] = max(table[i-1][j], table[i][j-1])

    # Now, table[n, m] is the length of LCS of x and y.
    return table[n][m]

def lcses(pattern, text):
    return [lcs(pattern, text[i:i+2*l]) for i in xrange(0,l)]

l = 15
#Create the pattern
pattern = [random.choice('01') for i in xrange(2*l)]

#create text start and end and unknown. 
start = [random.choice('01') for i in xrange(l)]
end = [random.choice('01') for i in xrange(l)]
unknown = [random.choice('01') for i in xrange(l)]

lcslist= lcses(pattern, start+unknown+end)

count = 0
for test in itertools.product('01',repeat = l):
    test=list(test)
    testlist = lcses(pattern, start+test+end)
    if (testlist == lcslist):
        count += 1

print count

I tried converting it to numpy but I must have done it badly as it actually ran more slowly. Can this code be sped up a lot somehow?

Update. Following a comment below, it would be better if lcses used a recurrence directly which gave the LCS between pattern and all sublists of text of the same length. Is it possible to modify the classic dynamic programming LCS algorithm somehow to do this?

Upvotes: 3

Views: 499

Answers (1)

msw
msw

Reputation: 43517

The recurrence table table is being recomputed 15 times on every call to lcses() when it is only dependent upon m and n where m has a maximum value of 2*l and n is at most 3*l.

If your program only computed table once, it would be dynamic programming which it is not currently. A Python idiom for this would be

table = None
def use_lcs_table(m, n, l):
    global table
    if table is None:
        table = lcs(2*l, 3*l)
    return table[m][n]

Except using an class instance would be cleaner and more extensible than a global table declaration. But this gives you an idea of why its taking so much time.

Added in reply to comment:

Dynamic Programming is an optimization that requires a trade-off of extra space for less time. In your example you appear to be doing a table pre-computation in lcs() but you build the whole list on every single call and then throw it away. I don't claim to understand the algorithm you are trying to implement, but the way you have it coded, it either:

  1. Has no recurrence relation, thus no grounds for DP optimization, or
  2. Has a recurrence relation, the implementation of which you bungled.

Upvotes: 1

Related Questions