Reputation: 975

How to efficiently find identical substrings of a specified length in a collection of strings?

I have a collection S, typically containing 10-50 long strings. For illustrative purposes, suppose the length of each string ranges between 1000 and 10000 characters.

I would like to find strings of specified length k (typically in the range of 5 to 20) that are substrings of every string in S. This can obviously be done using a naive approach - enumerating every k-length substring in S[0] and checking if they exist in every other element of S.

Are there more efficient ways of approaching the problem? As far as I can tell, there are some similarities between this and the longest common subsequence problem, but my understanding of LCS is limited and I'm not sure how it could be adapted to the situation where we bound the desired common substring length to k, or if subsequence techniques can be applied to finding substrings.

Upvotes: 6

Answers (4)

AMieres

Reputation: 5014

I would try a simple method using HashSets:

Build a HashSet for each long string in S with all its k-strings.
Sort the sets by number of elements.
Scan the first set. Lookup the term in the other sets.

The first step takes care of repetitions in each long string. The second ensures the minimum number of comparisons.

let getHashSet k (lstr:string) =
    let strs = System.Collections.Generic.HashSet<string>()
    for i in 0..lstr.Length - k do
        strs.Add lstr.[i..i + k - 1] |> ignore
    strs

let getCommons k lstrs =
    let strss = lstrs |> Seq.map (getHashSet k) |> Seq.sortBy (fun strs -> strs.Count)
    match strss |> Seq.tryHead with
    | None   -> [||]
    | Some h ->
    let rest = Seq.tail strss |> Seq.toArray
    [|  for s in h do
            if rest |> Array.forall (fun strs -> strs.Contains s) then yield s
    |]

Test:

let random = System.Random System.DateTime.Now.Millisecond
let generateString n =
    [|  for i in 1..n do
            yield random.Next 20 |> (+) 65 |> System.Convert.ToByte
    |] |> System.Text.Encoding.ASCII.GetString


[ for i in 1..3 do yield generateString 10000 ]
|> getCommons 4
|> fun l -> printfn "found %d\n %A" l.Length l

result:

found 40
[|"PPTD"; "KLNN"; "FTSR"; "CNBM"; "SSHG"; "SHGO"; "LEHS"; "BBPD"; "LKQP"; "PFPH";
"AMMS"; "BEPC"; "HIPL"; "PGBJ"; "DDMJ"; "MQNO"; "SOBJ"; "GLAG"; "GBOC"; "NSDI";
"JDDL"; "OOJO"; "NETT"; "TAQN"; "DHME"; "AHDR"; "QHTS"; "TRQO"; "DHPM"; "HIMD";
"NHGH"; "EARK"; "ELNF"; "ADKE"; "DQCC"; "GKJA"; "ASME"; "KFGM"; "AMKE"; "JJLJ"|]

Here it is in fiddle: https://dotnetfiddle.net/ZK8DCT

Upvotes: 1

mcdowella

Reputation: 19621

I would treat each long string as a collection of overlapped short strings, so ABCDEFGHI becomes ABCDE, BCDEF, CDEFG, DEFGH, EFGHI. You can represent each short string as a pair of indexes, one specifying the long string and one the starting offset in that string (if this strikes you as naive, skip to the end).

I would then sort each collection into ascending order.

Now you can find the short strings common to the first two collection by merging the sorted lists of indexes, keeping only those from the first collection which are also present in the second collection. Check the survivors of this against the third collection, and so on and the survivors at the end correspond to those short strings which are present in all long strings.

(Alternatively you could maintain a set of pointers into each sorted list and repeatedly look to see if every pointer points at short strings with the same text, then advancing the pointer which points at the smallest short string).

Time is O(n log n) for the initial sort, which dominates. In the worst case - e.g. when every string is AAAAAAAA..AA - there is a factor of k on top of this, because all string compares check all characters and take time k. Hopefully, there is a clever way round this with https://en.wikipedia.org/wiki/Suffix_array which allows you to sort in time O(n) rather than O(nk log n) and the https://en.wikipedia.org/wiki/LCP_array, which should allow you to skip some characters when comparing substrings from different suffix arrays.

Thinking about this again, I think the usual suffix array trick of concatenating all of the strings in question, separated by a character not found in any of them, works here. If you look at the LCP of the resulting suffix array you can split it into sections, splitting at points where where the difference between suffixes occurs less than k characters in. Now each offset in any particular section starts with the same k characters. Now look at the offsets in each section and check to see if there is at least one offset from every possible starting string. If so, this k-character sequence occurs in all starting strings, but not otherwise. (There are suffix array constructions which work with arbitrarily large alphabets so you can always expand your alphabet to produce a character not in any string, if necessary).

Upvotes: 1

rici

Reputation: 241931

Here's one fairly simple algorithm, which should be reasonably fast.

Using a rolling hash as in the Rabin-Karp string search algorithm, construct a hash table H₀ of all the |S₀|-k+1 length k substrings of S₀. That's roughly O(|S₀|) since each hash is computed in O(1) from the previous hash, but it will take longer if there are collisions or duplicate substrings. Using a better hash will help you with collisions but if there are a lot of k-length duplicate substrings in S₀ then you could end up using O(k|S₀|).
Now use the same rolling hash on S₁. This time, look each substring up in H₀ and if you find it, remove it from H₀ and insert it into a new table H₁. Again, this should be around O(|S₁|) unless you have some pathological case, like both S₀ and S₁ are just long repetitions of the same character. (It's also going to be suboptimal if S₀ and S₀ are the same string, or have lots of overlapping pieces.)
Repeat step 2 for each S_i, each time creating a new hash table. (At the end of each iteration of step 2, you can delete the hash table from the previous step.)

At the end, the last hash table will contain all the common k-length substrings.

The total run time should be about O(Σ|S_i|) but in the worst case it could be O(kΣ|S_i|). Even so, with the problem size as described, it should run in acceptable time.

Upvotes: 3

MBo

Reputation: 80327

Some thoughts (N is number of strings, M is average length, K is needed substring size):

Approach 1:

Walk through all strings, computing rolling hash for k-length strings and storing these hashes in the map (store tuple {key: hash; string_num; position})

time O(NxM), space O(NxM)

Extract groups with equal hash, check step-by-step:
1) that size of group >= number of strings
2) all strings are represented in this group 3
3) thorough checking of real substrings for equality (sometimes hashes of distinct substrings might coincide)

Approach 2:

Build suffix array for every string

time O(N x MlogM) space O(N x M)

Find intersection of suffix arrays for the first string pair, using merge-like approach (suffixes are sorted), considering only part of suffixes of length k, then continue with the next string and so on

Upvotes: 1

How to efficiently find identical substrings of a specified length in a collection of strings?

Answers (4)

Related Questions