Reputation: 45667
I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101
, 01010100100
, OR 10100100010
within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.
Is there a faster method than doing
foreach (patterns) {
for (i=0; i < listLength; i++)
for(t=0; t < patternlength; t++)
if( list[i+t] != pattern[t] ) {
break;
}
if( t == patternlength - 1 ) {
return i; // pattern found!
}
}
}
}
Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.
Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.
Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says
The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly).
That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?
Upvotes: 4
Views: 4215
Reputation: 41775
The key for Googling is "multi-pattern" string matching.
Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep
. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.
An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.
I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.
Upvotes: 12
Reputation: 49013
If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.
The code looks like this:
try:
import psyco
psyco.full()
except ImportError:
pass
def encode(s):
def calc_count(count, c):
return count * (-1 if c == '0' else 1)
result = []
c = s[0]
count = 1
for i in range(1, len(s)):
d = s[i]
if d == c:
count += 1
else:
result.append(calc_count(count, c))
count = 1
c = d
result.append(calc_count(count, c))
return result
def search(encoded_source, targets):
def match(encoded_source, t, max_search_len, len_source):
x = len(t)-1
# Get the indexes of the longest segments and search them first
most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
# Align the signs of the source and target
index = (0 if encoded_source[0] * t[0] > 0 else 1)
unencoded_pos = sum(abs(c) for c in encoded_source[:index])
start_t, end_t = abs(t[0]), abs(t[x])
for i in range(index, len(encoded_source)-x, 2):
if all(t[j] == encoded_source[j+i] for j in most_restrictive):
encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
if start_t <= encoded_start and end_t <= encoded_end:
return unencoded_pos + (abs(encoded_source[i]) - start_t)
unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
if unencoded_pos > max_search_len:
return len_source
return len_source
len_source = sum(abs(c) for c in encoded_source)
i, found, target_index = len_source, None, -1
for j, t in enumerate(targets):
x = match(encoded_source, t, i, len_source)
print "Match at: ", x
if x < i:
i, found, target_index = x, t, j
return (i, found, target_index)
if __name__ == "__main__":
import datetime
def make_source_text(len):
from random import randint
item_len = 8
item_count = 2**item_len
table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
encoded_targets = [encode(t) for t in targets]
data_len = 10*1000*1000
s = datetime.datetime.now()
source_text = make_source_text(data_len)
e = datetime.datetime.now()
print "Make source text(length %d): " % data_len, (e - s)
s = datetime.datetime.now()
encoded_source = encode(source_text)
e = datetime.datetime.now()
print "Encode source text: ", (e - s)
s = datetime.datetime.now()
(i, found, target_index) = search(encoded_source, encoded_targets)
print (i, found, target_index)
print "Target was: ", targets[target_index]
print "Source matched here: ", source_text[i:i+len(targets[target_index])]
e = datetime.datetime.now()
print "Search time: ", (e - s)
On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.
psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:
Make source text(length 10000000): 0:00:02.277000
Encode source text: 0:00:00.329000
Match at: 2517905
Match at: 494990
Match at: 450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was: 1010010001010100100010
Source matched here: 1010010001010100100010
Search time: 0:00:04.325000
It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.
Upvotes: 0
Reputation: 45667
If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n
, sum the first n
bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.
It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.
Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.
Upvotes: 0
Reputation: 43120
A solution that could be efficient:
pattern_length
chars are in the trie, stop on success ( O(1) operation )If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
Upvotes: 1
Reputation: 4143
If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.
Jacob
Upvotes: 0
Reputation: 14849
You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ). BUT .. you have to build that array. It's only worth .. when the Text is static and the pattern dynamic .
Upvotes: 2
Reputation: 300559
Yes.
The Boyer–Moore string search algorithm
See also: Knuth–Morris–Pratt algorithm
Upvotes: 5