Jhowa
Jhowa

Reputation: 80

What is the minimum space complexity in problem of find longest common substring?

const char *_longest_common_substr(const char *s1, const char *s2, int n, int m) {
   // assert n >= m
   int max = 0; // keep track of max length found
   int tmp = 0; // value is incremented till letters match
   int begin = 0;
   int try_begin = 0; // possible new begin index
   int end = 0; // rv = s2[begin : end - 1]
   
   for (int i = 0; i < n; i++) {
       tmp = 0;
       try_begin = 0;
       // s1 is scanned circularly
       for (int j = 0; j < m; j++) {
           int index = (j + i) % n; // current s1 index
           if (index < n && s1[index] == s2[j]) {
               if (tmp == 0)
                   try_begin = j;
               tmp++;
           } else {
               tmp = 0;
           }
           if (tmp > max) {
               max = tmp;
               begin = try_begin;
               end = j + 1;
           }
       }
   }
       
   int size = begin >= end ? 0 : end - begin;
   char *rv = malloc((size + 1) * sizeof(*rv));
   int i;
   for (i = 0; i < size; i++)
       rv[i] = s2[begin + i];
   rv[i] = '\0';
   return rv;
}

const char *longest_common_substr(const char *s1, const char *s2, int n, int m) {
   if (n < m)
       return _longest_common_substr(s2, s1, m, n);
   return _longest_common_substr(s1, s2, n, m);
}

Is this code correct to find longest common substring? I don't understand why in many places, for example wikipedia, they use a matrix to solve the problem, while in this apparently simple solution there is no need for it and time complexity is still O(n*m) while space complexity is O(1). A possible test is

int main() {
    const char *str1 = "identification";
    const char *str2 = "administration";
    
    printf("%s\n", longest_common_substr(str1, str2, strlen(str1), strlen(str2)));
    
}

and the output is

ation

substring is intended in a circular way, so with input

action
tionac

the output will be

tionac

anyway i could add two different invalid character at the end of the strings to remove this property

Upvotes: 2

Views: 147

Answers (2)

chqrlie
chqrlie

Reputation: 144695

Your algorithm performs circular matching, which is different from the classic LCS algorithm. Yet it can be simplified to remove this extra matching and also remove the need to make s1 the longer string:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char *longest_common_substr(const char *s1, const char *s2, int n1, int n2) {
    int begin = 0;  // offset of lcs in s2
    int max = 0;    // length of lcs

    for (int i = 0; i < n1; i++) {
        int tmp = 0;        // count of matched letters
        int index = i;      // current s1 index
        for (int j = 0; j < n2; j++, index++) {
            if (index == n1) {
                index = 0;
                tmp = 0;    // do not match circularly
            }
            if (s1[index] == s2[j]) {
                tmp++;
                if (tmp > max) {
                    max = tmp;
                    begin = j - (tmp - 1);
                }
            } else {
                tmp = 0;
            }
        }
    }

    char *rv = malloc(sizeof(*rv) * (max + 1));
    memcpy(rv, s2 + begin, max);
    rv[max] = '\0';
    return rv;
}

int main(int argc, char *argv[]) {
    if (argc == 3) {
        const char *s1 = argv[1];
        const char *s2 = argv[2];
        char *p = longest_common_substr(s1, s2, strlen(s1), strlen(s2));
        if (p) {
            printf("lcs('%s','%s') -> '%s'\n", s1, s2, p);
        }
    }
    return 0;
}

The above does have a time complexity of O(n1*n2) and a space complexity of O(1). Achieving better time complexity is usually a desired goal, which may explain why the Wikipedia article has a suboptimal example for the more complex case.

Upvotes: 1

bolov
bolov

Reputation: 75688

Your algorithm is a weird one. You count circular substrings on the longest string only. And if the strings are of equal length you count circular substrings on the first one only.

For instance:

  • for "bxa" and "ab" you find the solution "ab" because you count circular substrings on bxa

  • but for "bxa" and "zaby" you don't consider circular substrings in "bxa" and you don't find "ab" as a solution

  • ("abx", "bya") and ("bya", "abx") have different solutions even though only the position of the arguments changes. (you count circular substrings only on the first argument if both arguments are of equal length).

You must either:

  • not count circular substrings at all (the classic algorithm as described in the wiki link) or:

  • count circular substrings for both strings (this is a different algorithm then longest common substring)

So the complexity of your algorithm is not comparable with the wiki algorithm because it is a different algorithm solving a different problem.

Upvotes: 2

Related Questions