Reputation: 4534
How can I efficiently check if a string is a substring of any a given strings?
The naive way would be:
def is_substring_of_any(phrase, known_phrases):
for known_phrase in known_phrases:
if phrase in known_phrase:
return True
return False
However, it scales badly for large number of strings:
known_phrases = [phrase_generator(15) for _ in tqdm(range(10 ** 6), desc="Generating known phrases")]
phrases_to_check = [phrase_generator(7) for _ in tqdm(range(10 ** 5), desc="Generating phrases to check")]
for phrase in tqdm(phrases_to_check):
is_substring_of_any(phrase, known_phrases)
giving >1h to process them all:
Generating known phrases: 100%|██████████| 1000000/1000000 [00:13<00:00, 73370.36it/s]
Generating phrases to check: 100%|██████████| 100000/100000 [00:00<00:00, 137534.76it/s]
6%|▌ | 5991/100000 [04:23<1:11:20, 21.96it/s]
Is there a way to run it faster without setting up additional infrastructure?
Upvotes: 0
Views: 151
Reputation: 516
Boyer–Moore string-search algorithm
Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature.
The Boyer–Moore algorithm searches for occurrences of $P$ in $T$ by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments $m-n+1$ Boyer–Moore uses information gained by preprocessing P to skip as many alignments as possible.
The key insight in this algorithm is that if the end of the pattern is compared to the text, then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match, there is no need to continue searching backwards along the text. If the character in the text does not match any of the characters in the pattern, then the next character in the text to check is located n characters farther along the text, where n is the length of the pattern. If the character in the text is in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. Jumping along the text to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the efficiency of the algorithm.
More formally, the algorithm begins at alignment $k=n$, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving backward. The strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found.
The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.
Python Implementation:
from typing import *
# This version is sensitive to the English alphabet in ASCII for case-insensitive matching.
# To remove this feature, define alphabet_index as ord(c), and replace instances of "26"
# with "256" or any maximum code-point you want. For Unicode you may want to match in UTF-8
# bytes instead of creating a 0x10FFFF-sized table.
ALPHABET_SIZE = 26
def alphabet_index(c: str) -> int:
"""Return the index of the given character in the English alphabet, counting from 0."""
val = ord(c.lower()) - ord("a")
assert val >= 0 and val < ALPHABET_SIZE
return val
def match_length(S: str, idx1: int, idx2: int) -> int:
"""Return the length of the match of the substrings of S beginning at idx1 and idx2."""
if idx1 == idx2:
return len(S) - idx1
match_count = 0
while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]:
match_count += 1
idx1 += 1
idx2 += 1
return match_count
def fundamental_preprocess(S: str) -> List[int]:
"""Return Z, the Fundamental Preprocessing of S.
Z[i] is the length of the substring beginning at i which is also a prefix of S.
This pre-processing is done in O(n) time, where n is the length of S.
"""
if len(S) == 0: # Handles case of empty string
return []
if len(S) == 1: # Handles case of single-character string
return [1]
z = [0 for x in S]
z[0] = len(S)
z[1] = match_length(S, 0, 1)
for i in range(2, 1 + z[1]): # Optimization from exercise 1-5
z[i] = z[1] - i + 1
# Defines lower and upper limits of z-box
l = 0
r = 0
for i in range(2 + z[1], len(S)):
if i <= r: # i falls within existing z-box
k = i - l
b = z[k]
a = r - i + 1
if b < a: # b ends within existing z-box
z[i] = b
else: # b ends at or after the end of the z-box, we need to do an explicit match to the right of the z-box
z[i] = a + match_length(S, a, r + 1)
l = i
r = i + z[i] - 1
else: # i does not reside within existing z-box
z[i] = match_length(S, 0, i)
if z[i] > 0:
l = i
r = i + z[i] - 1
return z
def bad_character_table(S: str) -> List[List[int]]:
"""
Generates R for S, which is an array indexed by the position of some character c in the
English alphabet. At that index in R is an array of length |S|+1, specifying for each
index i in S (plus the index after S) the next location of character c encountered when
traversing S from right to left starting at i. This is used for a constant-time lookup
for the bad character rule in the Boyer-Moore string search algorithm, although it has
a much larger size than non-constant-time solutions.
"""
if len(S) == 0:
return [[] for a in range(ALPHABET_SIZE)]
R = [[-1] for a in range(ALPHABET_SIZE)]
alpha = [-1 for a in range(ALPHABET_SIZE)]
for i, c in enumerate(S):
alpha[alphabet_index(c)] = i
for j, a in enumerate(alpha):
R[j].append(a)
return R
def good_suffix_table(S: str) -> List[int]:
"""
Generates L for S, an array used in the implementation of the strong good suffix rule.
L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches
a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to
shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]]
matches the substring of T matched by a suffix of P in the previous match attempt.
Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given
by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used.
Since only proper suffixes matter, L[0] = -1.
"""
L = [-1 for c in S]
N = fundamental_preprocess(S[::-1]) # S[::-1] reverses S
N.reverse()
for j in range(0, len(S) - 1):
i = len(S) - N[j]
if i != len(S):
L[i] = j
return L
def full_shift_table(S: str) -> List[int]:
"""
Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore
string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a
prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the
text T is len(P) - F[i] for a mismatch occurring at i-1.
"""
F = [0 for c in S]
Z = fundamental_preprocess(S)
longest = 0
for i, zv in enumerate(reversed(Z)):
longest = max(zv, longest) if zv == i + 1 else longest
F[-i - 1] = longest
return F
def string_search(P, T) -> List[int]:
"""
Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P
in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal
amount to shift the string and skip comparisons. In practice it runs in O(m) (and even
sublinear) time, where m is the length of T. This implementation performs a case-insensitive
search on ASCII alphabetic characters, spaces not included.
"""
if len(P) == 0 or len(T) == 0 or len(T) < len(P):
return []
matches = []
# Preprocessing
R = bad_character_table(P)
L = good_suffix_table(P)
F = full_shift_table(P)
k = len(P) - 1 # Represents alignment of end of P relative to T
previous_k = -1 # Represents alignment in previous phase (Galil's rule)
while k < len(T):
i = len(P) - 1 # Character to compare in P
h = k # Character to compare in T
while i >= 0 and h > previous_k and P[i] == T[h]: # Matches starting from end of P
i -= 1
h -= 1
if i == -1 or h == previous_k: # Match has been found (Galil's rule)
matches.append(k - len(P) + 1)
k += len(P) - F[1] if len(P) > 1 else 1
else: # No match, shift by max of bad character and good suffix rules
char_shift = i - R[alphabet_index(T[h])][i]
if i + 1 == len(P): # Mismatch happened on first attempt
suffix_shift = 1
elif L[i + 1] == -1: # Matched suffix does not appear anywhere in P
suffix_shift = len(P) - F[i + 1]
else: # Matched suffix appears in P
suffix_shift = len(P) - 1 - L[i + 1]
shift = max(char_shift, suffix_shift)
previous_k = k if shift >= i + 1 else previous_k # Galil's rule
k += shift
return matches
For more Information:
https://en.m.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
Upvotes: 1