Reputation: 332
Iam looking for an algorithm to find a set of common substrings in at least 'k' number of strings out of 'n' number of total strings.
For example, if I have 5 strings then n=5 and I want to find all the substrings which are common in any 3 or more strings, then k=3.
The output should be a set of substrings of any length, which are common in at least 3 strings.
So
should give something like abc, ab, bc, b, c, a, x
The problem with searching internet is that I do not know exact key words for such a problem. The basic searches results in Longest Common Substring, which may be related but not exactly what I am looking for.
Is my problem, a known problem with an established solution published in literature? Any pointers to proper keywords or reference to an article?
Or I will have to specify it as my own problem and create an algorithm for it? The very basic idea in my mind is to
The Result will have all the common substrings which are common in at least k strings. I think this will work, but if there is some clever way of doing the same thing then I would like to use and reference that.
In particular I am not after any language specific implementation. I am just looking for the algorithm, or any references to such a problem and its solutions in the literature.
Upvotes: 0
Views: 216
Reputation: 18566
You can do it this way:
Create a map from string to number of occurrences(let's call it cnt
).
For each given string, do the following:
Create a set of all substrings S
.
For each string in S
, add one to the number of occurrences of this string in cnt
.
Choose all entries from the cnt
map which have number of occurrences >= k
.
Some pseudo code:
cnt = an empty map
for string <- strings
for substr <- set of substrings of the string
cnt[substr]++
for entry <- cnt
if entry.value >= k
print entry.key
Upvotes: 0