n00bc0d3r
n00bc0d3r

Reputation: 275

Finding substrings of string such that product of the length of the substring with its number of occurrences is maximized

I was thinking of the following problem: Given a string S, let the length of the ith substring be li and number of occurrences of the ith substring be oi. Print the substring such that li*oi is maximized.

I have O(n3) solution (brute force) for this problem where I am generating all the substrings and finding the substring with maximum value. My code for the same is as follows:

public static void solve(String S) {
    long max = Integer.MIN_VALUE;
    String res = "";
    for (int i = 0; i < S.length(); i++) {
        for (int j = 1; j <= S.length() - i; j++) {
            String s = S.substring(i, i + j);
            int o = countOccurrences(S, s);
            long p = (long) o * (long) s.length();
            if (max < p) {
                max = p;
                res = s;
            }
        }
    }
    System.out.println(res);
}

where countOccurrences() method takes O(n) time. I was wondering if there was a more efficient way to achieve this.

Upvotes: 5

Views: 836

Answers (4)

Vikram Bhat
Vikram Bhat

Reputation: 6246

Improved algorithm will be as follows :- (Psuedo code)

int max = 0;

 for(int i=0;i<str.length;) {

   occurences = get_index(str[i]);

   for(int j=i+1;j<str.length;j++) {

      if(occurences.size()==0)
         break;

      for(int k=0;k<occurences.size();k++) {

           int next_ind = occurrences[k]+j-i;

           if(next_ind==length||str[next_ind]!=str[j]) {
               delete(occurences[k]);
           }
       }

       int score = ocurrences.size()*(j-i+1);

       if(max<score)
         score = max; 
   } 

 }

Time Complexity :-

Ideally the occurrences of the substring should decrease like n,n/2,n/3.... at max so hence the algorithm is n*n*(1+1/2+1/3...) = O(n^2*logn) as (1+1/2+1/3...1/n) => logn. The occurence deletion can be done easily in O(1) using linked list.

Upvotes: 0

chrisb2244
chrisb2244

Reputation: 3001

Presumably you can calculate the maximum available 'score' with what you have left, and thus avoid checking for substrings beyond a certain point. I'd guess it still scales badly, but it artificially reduces your 'n' value.

For example, if you have a string of length 10, and you manage to get a 3 character substring twice in the first 3 characters (ie 0-2, 3-5), then you can say max = 6. At that point, once you move beyond i=4, you can't score more than 6, so you can stop?

Upvotes: 0

j_random_hacker
j_random_hacker

Reputation: 51246

Here's a linear-time algorithm:

  1. Build a suffix tree on the input string. This takes O(n) time and space.
  2. Traverse the suffix tree in postorder DFS, calculating the number of descendants for each node by summing the values of its children. As soon as this quantity is known for a node, multiply it with its string length (which is the sum of the length of all edges from the root) and update the best-so-far total if necessary. This also takes O(n) time.

The key points are that

  • A suffix tree contains only a linear number of internal nodes, and
  • Any substring that does not correspond to an internal node cannot produce a maximal score. This is because as you trace it from the suffix tree root it must reach "partway down" some edge -- but you can always extend it further without reducing the number of occurrences (which is the number of descendants), and thus increase the score, by continuing on down to the next node.

It might also be possible to do this using suffix arrays instead of suffix trees, in which case it's likely to require a constant factor less memory, but add a logarithmic factor to the running time.

Upvotes: 2

singe3
singe3

Reputation: 2105

An idea : You could shorten a bit the algorithm by creating an occurence list. This way before running countOccurences you would check if it hasn't been already counted for that occurence and you wouldn't run countOccurences again for that same occurence.

This is a first optimization, there should be other ones but it's a start.

Upvotes: 1

Related Questions