Algorithm to find common substring across N strings

Question

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.

strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)

common strings:

1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)

longest common strings:

1/3 (ABCDEF)

most common strings:

1/2/3 (DEF)

Rex Kerr · Accepted Answer

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

Algorithm to find common substring across N strings

Answers (2)

Related Questions