What is the fastest method to find all substrings that are contained in a string?

Question

I have a large collection of substrings. For example one million of strings like:

["abc", "bane", "cadb" ... "one", "mno" ... "zz", "zzz"]

I want to find which of these substrings are included into each of input strings. For example strings like:

"abcdefg hijklmnop" (contains "abc", "mno")
"something one, another two, and zzz" (contains "one", "zz", "zzz")
"однажды в студеную зимнюю пору" (contains nothing)

It is a search of millions of input strings against millions of substrings. It is important to find all matching substrings, even overlapping ones. What would be the most efficient approach for such a search?

I guess I could build a trigram database, find matching trigrams in input and substrings, and then use slow matching algorithm against resulting smaller dataset. But may be there is a better way?

Eloy P&#233;rez Torres · Accepted Answer

Your problem is the origin of Aho-Corasick Algorithm, a string-searching algorithm invented by Alfred V. Aho and Margaret J. Corasick, that matches simultaneously the occurrences of all patterns within the input text.

Its running time is O(n + m + M) where n is the sum of lengths of the patterns, m is the length of input text and M is the amount of matches.

If you find suitable to use a database as input data is huge take into consideration this linear time algorithm.

What is the fastest method to find all substrings that are contained in a string?

Answers (1)

Related Questions