ResVict
ResVict

Reputation: 332

Set of all common substrings in 'k' or more strings out of 'n' strings

Iam looking for an algorithm to find a set of common substrings in at least 'k' number of strings out of 'n' number of total strings.

For example, if I have 5 strings then n=5 and I want to find all the substrings which are common in any 3 or more strings, then k=3.

The output should be a set of substrings of any length, which are common in at least 3 strings.

So

  1. abcdx
  2. efgh
  3. xyabcz
  4. ijklxmno
  5. pqrstbcduvwxabcd

should give something like abc, ab, bc, b, c, a, x

The problem with searching internet is that I do not know exact key words for such a problem. The basic searches results in Longest Common Substring, which may be related but not exactly what I am looking for.

Is my problem, a known problem with an established solution published in literature? Any pointers to proper keywords or reference to an article?

Or I will have to specify it as my own problem and create an algorithm for it? The very basic idea in my mind is to

  1. Break all the strings to their substrings and store them in their corresponding set.
  2. Take all the 'n' sets and take their Intersection as Result.
  3. For all the combinations of n-1 sets Choose a combination and take their Intersection and Union it with Result.
  4. Repeat step 3 for n-1,n-2,n-3,...,k size of combinations

The Result will have all the common substrings which are common in at least k strings. I think this will work, but if there is some clever way of doing the same thing then I would like to use and reference that.

In particular I am not after any language specific implementation. I am just looking for the algorithm, or any references to such a problem and its solutions in the literature.

Upvotes: 0

Views: 216

Answers (1)

kraskevich
kraskevich

Reputation: 18566

You can do it this way:

  1. Create a map from string to number of occurrences(let's call it cnt).

  2. For each given string, do the following:

    • Create a set of all substrings S.

    • For each string in S, add one to the number of occurrences of this string in cnt.

  3. Choose all entries from the cnt map which have number of occurrences >= k.

Some pseudo code:

cnt = an empty map
for string <- strings
    for substr <- set of substrings of the string
        cnt[substr]++
for entry <- cnt
    if entry.value >= k
        print entry.key

Upvotes: 0

Related Questions