Reputation: 9372
I have a large collection of substrings. For example one million of strings like:
["abc", "bane", "cadb" ... "one", "mno" ... "zz", "zzz"]
I want to find which of these substrings are included into each of input strings. For example strings like:
"abcdefg hijklmnop" (contains "abc", "mno")
"something one, another two, and zzz" (contains "one", "zz", "zzz")
"однажды в студеную зимнюю пору" (contains nothing)
It is a search of millions of input strings against millions of substrings. It is important to find all matching substrings, even overlapping ones. What would be the most efficient approach for such a search?
I guess I could build a trigram database, find matching trigrams in input and substrings, and then use slow matching algorithm against resulting smaller dataset. But may be there is a better way?
Upvotes: 0
Views: 174
Reputation: 1177
Your problem is the origin of Aho-Corasick Algorithm, a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick, that matches simultaneously the occurrences of all patterns within the input text.
Its running time is O(n + m + M)
where n
is the sum of lengths of the patterns, m
is the length of input text and M
is the amount of matches.
If you find suitable to use a database as input data is huge take into consideration this linear time algorithm.
Upvotes: 3