recursion.ninja
recursion.ninja

Reputation: 5488

Find the dominant cyclic substring of a given string

I am looking for an algorithm to find the dominant cyclic substring of a given string.

A cyclic substring:

A dominant cyclic substring:

Example 1:

Example 2:

Example 3:

Example 4:

Exampe 5:

What is the best approach for implementing this algorithm?

Upvotes: 2

Views: 2239

Answers (4)

Floris
Floris

Reputation: 46425

Here is a possible approach (made simpler by the fact that your cycles have to be adjacent). Pick a string. See if it repeats. Keep track of the most repeated one.

EDIT actual tested Python code:

testStrings =[ "prefixgarbagecyclecyclecyclecyclesufixgarbage",
               "prefixgarbagecyclepadinggarbagecyclesufixgarbage",
               "prefixgarbagecyclecyclepadinggarbageprefixgarbage",
               "prefixgarbagecyclecyclecycleroundroundroundprefixgarbage",
               "abcdefghijklmnopqrstuvqxyz"];

for input in testStrings:

 repCountMax = 0
 longestCycle = ""
 repCount = 0
 for i in range (1, len(input)):
  for j in range( i+1, len(input)):
    substring = input[i:j]
    #print substring
    ls = len(substring)
    repCount = 1
    k = j
    while(substring == input[k:k+ls]):
      k = k + ls
      repCount = repCount +1
      #print "repetition ", repCount, " of ", substring, "\n"
    if (repCount > repCountMax) or ((repCount == repCountMax) and len(substring) > len(bestCycle)):
      repCountMax = repCount
      bestCycle = substring

 if repCountMax > 1:
  print "best cycle in '", input, "' is '", bestCycle,"' which is repeated ", repCountMax, " times."
 else:
  print "no repeated cycles found in string ", input

Resulting output:

best cycle in ' prefixgarbagecyclecyclecyclecyclesufixgarbage ' is ' ecycl ' which is repeated 4 times.

best cycle in ' prefixgarbagecyclepadinggarbagecyclesufixgarbage ' is ' g ' which is repeated 2 times.

best cycle in ' prefixgarbagecyclecyclepadinggarbageprefixgarbage ' is ' ecycl ' which is repeated 2 times.

best cycle in ' prefixgarbagecyclecyclecycleroundroundroundprefixgarbage ' is 'ecycl ' which is repeated 3 times.

no repeated cycles found in string abcdefghijklmnopqrstuvqxyz

Note - the cycle found was ecycl, not cycle. ecycl occurred first...

Second note - you could make things marginally more efficient by stopping when you can no longer "beat" the current best estimate - for example, if you have found five repeats already, and given the size of the string you are searching for there isn't space for six repeats. This will give a speed improvement when there is a significant number of repeats. See Evgeny's solution for a way to implement that.

Upvotes: 1

Evgeny Kluev
Evgeny Kluev

Reputation: 24667

Cannot find anything better than this quadratic-time algorithm (implemented in Python):

IREP, IPER, IPOS = 0, 1, 2

def find_dominant(src):
  best = (0, 0, 0) # repetitions-1, period, position

  period = 0
  while period < len(src) // max(2, 1 + best[IREP]):
    period += 1
    length = 0

    for pos in range(len(src) - 1 - period, -1, -1):
      if src[pos] == src[pos + period]:
        length += 1
        repetitions = length // period
        if repetitions >= best[IREP]:
          best = (repetitions, period, pos)
      else:
        length = 0

  return best


s = "prefixgarbagecyclecyclecyclecyclesufixgarbage"
res = find_dominant(s)
if res[0] == 0:
  print("nothing found")
else:
  print(res[IREP] + 1, '*', s[res[IPOS]: res[IPOS] + res[IPER]])

For each possible period scan the string and remember the longest periodic subsequence. Scan it backwards to check less conditions. Stop increasing period when no further improvement could be found.

Time complexity is O(N2 / R), where R is the number of repetitions of dominant substring. Space complexity is O(1).

Upvotes: 2

Lee Meador
Lee Meador

Reputation: 12985

Scan from left to right.

You need some key/value pairs. The key is a letter. The value includes the index of the last instance of the letter found to that point in the scan AND info about any cycle the letter belongs to with two or more strings (the last one beginning with that letter in that column).

You need a place to store info about any cycles found. Call that the "cycle store."

As you scan do this at each index:

  • Use the letter there. See if its in the table of keys.
  • If it is found, look ahead to see if the following letters match the letters between the one found in the table (the previous occurrence) and this letter (at the current scan index).
  • If they match, we have a cycle, see if the previous occurrence has info showing this is a continuation of the stored cycle info. (Note: Adjacent letters may be a special case.)
  • if it is a continuation,
    • add to the stored cycle info to include these chars
    • update the index of the last found occurrence of this letter
    • update info about this cycle in the cycle store
  • if it is not a continuation
    • create (or replace) the stored cycle info to show this new cycle (count=2)
    • update the index of the last found occurrence of this letter
    • add info about this cycle in the cycle store
  • if the letters after this occurrence don't match the letters after that occurrence,
    • remove any stored cycle info for this letter
    • update the index of the last found occurrence of this letter
  • if the letter isn't in the table, add a key/value pair for this letter and this index.

When you get done scanning, look through the cycle store to see which one is dominant.

Note that you probably don't have to store all the cycles until the end but its not obvious to me right off how to decide which ones you can throw away. Probably something about keeping the ones that are still possibly continuing based on the content of the key/value pair table plus the dominant one so far.

Upvotes: 1

Abhishek Bansal
Abhishek Bansal

Reputation: 12705

I think a modified form of KMP algorithm can be applied here.

Traverse the string from the beginning.

Take the 1st letter, store it. Take the next letter, if its the same, then there are 2 candidates for repeated cycle, else add this to the existing string.

Basically at each letter you will have a list of possible substrings for repitition possibilities. Ofcourse you will have to keep track of the length and max no. Of repititions till that point.

This can be done in O (n) time if I am not mistaken.

Upvotes: 0

Related Questions