Reputation: 23311
I've been playing around with the Boyer-Moore sting search algorithm and starting with a base code set from Shriphani Palakodety I created 2 additional versions (v2 and v3) - each making some modifications such as removing len() function from the loop and than refactoring the while/if conditions. From v1 to v2 I see about a 10%-15% improvement and from v1 to v3 a 25%-30% improvement (significant).
My question is: does anyone have any additional mods that would improve performance even more (if you can submit as a v4) - keeping the base 'algorithm' true to Boyer-Moore.
#!/usr/bin/env python
import time
bcs = {} #the table
def goodSuffixShift(key):
for i in range(len(key)-1, -1, -1):
if key[i] not in bcs.keys():
bcs[key[i]] = len(key)-i-1
#---------------------- v1 ----------------------
def searchv1(text, key):
"""base from Shriphani Palakodety fixed for single char"""
i = len(key)-1
index = len(key) -1
j = i
while True:
if i < 0:
return j + 1
elif j > len(text):
return "not found"
elif text[j] != key[i] and text[j] not in bcs.keys():
j += len(key)
i = index
elif text[j] != key[i] and text[j] in bcs.keys():
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
#---------------------- v2 ----------------------
def searchv2(text, key):
"""removed string len functions from loop"""
len_text = len(text)
len_key = len(key)
i = len_key-1
index = len_key -1
j = i
while True:
if i < 0:
return j + 1
elif j > len_text:
return "not found"
elif text[j] != key[i] and text[j] not in bcs.keys():
j += len_key
i = index
elif text[j] != key[i] and text[j] in bcs.keys():
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
#---------------------- v3 ----------------------
def searchv3(text, key):
"""from v2 plus modified 3rd if condition
breaking down the comparison for efficiency,
modified the while loop to include the first
if condition (opposite of it)
"""
len_text = len(text)
len_key = len(key)
i = len_key-1
index = len_key -1
j = i
while i >= 0 and j <= len_text:
if text[j] != key[i]:
if text[j] not in bcs.keys():
j += len_key
i = index
else:
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
if j > len_text:
return "not found"
else:
return j + 1
key_list = ["POWER", "HOUSE", "COMP", "SCIENCE", "SHRIPHANI", "BRUAH", "A", "H"]
text = "SHRIPHANI IS A COMPUTER SCIENCE POWERHOUSE"
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv1(text, key)
searchv1(text, key)
bcs = {}
t2 = time.clock()
print('v1 took %0.5f ms' % ((t2-t1)*1000.0))
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv2(text, key)
searchv2(text, key)
bcs = {}
t2 = time.clock()
print('v2 took %0.5f ms' % ((t2-t1)*1000.0))
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv3(text, key)
searchv3(text, key)
bcs = {}
t2 = time.clock()
print('v3 took %0.5f ms' % ((t2-t1)*1000.0))
Upvotes: 2
Views: 4196
Reputation: 82992
Using "in bcs.keys()" is creating a list and then doing an O(N) search of the list -- just use "in bcs".
Do the goodSuffixShift(key) thing inside the search function. Two benefits: the caller has only one API to use, and you avoid having bcs as a global (horrid ** 2).
Your indentation is incorrect in several places.
Update
This is not the Boyer-Moore algorithm (which uses TWO lookup tables). It looks more like the Boyer-Moore-Horspool algorithm, which uses only the first BM table.
A probable speedup: add the line 'bcsget = bcs.get' after setting up the bcs dict. Then replace:
if text[j] != key[i]:
if text[j] not in bcs.keys():
j += len_key
i = index
else:
j += bcs[text[j]]
i = index
with:
if text[j] != key[i]:
j += bcsget(text[j], len_key)
i = index
Update 2 -- back to basics, like getting the code correct before you optimise
Version 1 has some bugs which you have carried forward into versions 2 and 3. Some suggestions:
Change the not-found response from "not found" to -1. This makes it compatible with text.find(key), which you can use to check your results.
Get some more text values e.g. "R" * 20, "X" * 20, and "XXXSCIENCEYYY" for use with your existing key values.
Lash up a test harness, something like this:
func_list = [searchv1, searchv2, searchv3]
def test():
for text in text_list:
print '==== text is', repr(text)
for func in func_list:
for key in key_list:
try:
result = func(text, key)
except Exception, e:
print "EXCEPTION: %r expected:%d func:%s key:%r" % (e, expected, func.__name__, key)
continue
expected = text.find(key)
if result != expected:
print "ERROR actual:%d expected:%d func:%s key:%r" % (result, expected, func.__name__, key)
Run that, fix the errors in v1, carry those fixes forward, run the tests again until they're all OK. Then you can tidy up your timing harness along the same lines, and see what the performance is. Then you can report back here, and I'll give you my idea of what a searchv4 function should look like ;-)
Upvotes: 4