Various search algorithms and performance for full text matching

Question

Let's say I have the following string:

/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt

How would I determine the fastest way to search this string for a match given input terms ['sicario', '419']

For example, the most basic version would be:

1) String contains:

s = '/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt'
terms = ['sicario', '419']
has_match = all([term.lower() in s.lower() for term in terms])

2) Regex

has_match = all([re.compile(term, re.I).search(s) for term in terms])

Other possible options would be:

Rabin-Karp
Boyer-Moore
And other string-searching algorithms

What would be the time complexity of the various algorithms?

Here is what I got for timing regex vs str() in:

 import timeit

 # Case insensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s.lower() for term in terms])', setup=setup).repeat(7, 1000))
0.00134181976318

# Case sensitive
setup =  's="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; terms = ["sicario", "419"]'
print min(timeit.Timer('all([term in s for term in terms])', setup=setup).repeat(7, 1000))
0.000231027603149


# Regex case insensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario", re.I), re.compile("419", re.I)]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.00111889839172


# Regex case sensitive
setup =  'import re; s="/Volumes/01/LG_SICARIO_ES419SUB_16X9_240_2398_DIGITAL_FINAL.itt"; compiled_terms = [re.compile("sicario"), re.compile("419")]'
print min(timeit.Timer('all([compiled_term.search(s) for compiled_term in compiled_terms])', setup=setup).repeat(7, 1000))
0.000588893890381

It's pretty close, though case-sensitive string searches perform about 2x better than regex (at least on this input data).

kederrac · Accepted Answer

I think your basic version is the fastest (a combination of Boyer-Moore and Horspoo) available (sublinear search behaviour in good cases (O(n/m)), I will add small changes to your basic version:

def test():
    lower_s = s .lower()
    return all([term in lower_s for term in terms])

Various search algorithms and performance for full text matching

Answers (1)

Related Questions