Mason Wheeler
Mason Wheeler

Reputation: 84540

How do I find the largest sequence in a string that is repeated at least once?

Trying to solve the following problem:

Given a string of arbitrary length, find the longest substring that occurs more than one time within the string, with no overlaps.

For example, if the input string was ABCABCAB, the correct output would be ABC. You couldn't say ABCAB, because that only occurs twice where the two substrings overlap, which is not allowed.

Is there any way to solve this reasonably quickly for strings containing a few thousand characters?

(And before anyone asks, this is not homework. I'm looking at ways to optimize the rendering of Lindenmayer fractals, because they tend to take excessive amounts of time to draw at high iteration levels with a naive turtle graphics system.)

Upvotes: 9

Views: 4365

Answers (4)

morishuz
morishuz

Reputation: 2402

This type of analysis is often done in genome sequences. have a look at this paper. it has an efficient implemention (c++) for solving repeats: http://www.complex-systems.com/pdf/17-4-4.pdf might be what you are looking for

Upvotes: 0

piotrek
piotrek

Reputation: 14520

it looks like a suffix tree problem. Create the suffix tree, then find the biggest compressed branch with more than one child (occurs more than once in the original string). The number of letters in that compressed branch should be the size of the biggest subsequence.

i found something similar here: http://www.coderanch.com/t/370396/java/java/Algorithm-wanted-longest-repeating-substring

Looks like it can be done in O(n).

Upvotes: 2

Roman Dzhabarov
Roman Dzhabarov

Reputation: 521

First we need to define the start symbol of our substring and define the length. Iterate all possible start positions then figure out the length doing binary search for the length (if you can find substr with lenght a, you may find with the longer length, function looks monotonous so bin search should be fine). Then find equal substring is N, using KMP or Rabin-Karp any linear algo is fine. Total N*N*log(N). Is that too much complexity? The code is something like:

for(int i=0;i<input.length();++i)
    {
        int l = i;
        int r = input.length();
        while(l <= r)
        {
            int middle = l + ((r - l) >> 1);
            Check if string [i;middle] can be found in initial string. Should be done in O(n); You need to check parts of initial string [0,i-1], [middle+1;length()-1];
            if (found)
                l = middle + 1;
            else
                r = middle - 1;
        }
    }

Make sense?

Upvotes: 0

Jim Garrison
Jim Garrison

Reputation: 86764

Here's an example for a string of length 11, which you can generalize

  • Set chunk length to floor(11/2) = 5

  • Scan the string in chunks of 5 characters left to looking for repeats. There will be 3 comparisons

    Left      Right
    Offset    Offset
     0         5
     0         6
     1         5
  • If you found a duplicate you're done. Otherwise reduce the chunk length to 4 and repeat until chunk length goes to zero.

Here's some (obviously untested) pseudocode:

String s
int len = floor(s.length/2)
for int i=len; i>0; i--
    for j=0; j<=len-(2*i); j++
        for k=j+i; k<=len-i; k++
            if s.substr(j,j+i) == s.substr(k,k+i)
                return s.substr(j,j+i)
return null

There may be an off-by-one error in there, but the approach should be sound (and minimal).

Upvotes: 3

Related Questions