Reputation: 80
const char *_longest_common_substr(const char *s1, const char *s2, int n, int m) {
// assert n >= m
int max = 0; // keep track of max length found
int tmp = 0; // value is incremented till letters match
int begin = 0;
int try_begin = 0; // possible new begin index
int end = 0; // rv = s2[begin : end - 1]
for (int i = 0; i < n; i++) {
tmp = 0;
try_begin = 0;
// s1 is scanned circularly
for (int j = 0; j < m; j++) {
int index = (j + i) % n; // current s1 index
if (index < n && s1[index] == s2[j]) {
if (tmp == 0)
try_begin = j;
tmp++;
} else {
tmp = 0;
}
if (tmp > max) {
max = tmp;
begin = try_begin;
end = j + 1;
}
}
}
int size = begin >= end ? 0 : end - begin;
char *rv = malloc((size + 1) * sizeof(*rv));
int i;
for (i = 0; i < size; i++)
rv[i] = s2[begin + i];
rv[i] = '\0';
return rv;
}
const char *longest_common_substr(const char *s1, const char *s2, int n, int m) {
if (n < m)
return _longest_common_substr(s2, s1, m, n);
return _longest_common_substr(s1, s2, n, m);
}
Is this code correct to find longest common substring? I don't understand why in many places, for example wikipedia, they use a matrix to solve the problem, while in this apparently simple solution there is no need for it and time complexity is still O(n*m) while space complexity is O(1). A possible test is
int main() {
const char *str1 = "identification";
const char *str2 = "administration";
printf("%s\n", longest_common_substr(str1, str2, strlen(str1), strlen(str2)));
}
and the output is
ation
substring is intended in a circular way, so with input
action
tionac
the output will be
tionac
anyway i could add two different invalid character at the end of the strings to remove this property
Upvotes: 2
Views: 147
Reputation: 144695
Your algorithm performs circular matching, which is different from the classic LCS algorithm. Yet it can be simplified to remove this extra matching and also remove the need to make s1
the longer string:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *longest_common_substr(const char *s1, const char *s2, int n1, int n2) {
int begin = 0; // offset of lcs in s2
int max = 0; // length of lcs
for (int i = 0; i < n1; i++) {
int tmp = 0; // count of matched letters
int index = i; // current s1 index
for (int j = 0; j < n2; j++, index++) {
if (index == n1) {
index = 0;
tmp = 0; // do not match circularly
}
if (s1[index] == s2[j]) {
tmp++;
if (tmp > max) {
max = tmp;
begin = j - (tmp - 1);
}
} else {
tmp = 0;
}
}
}
char *rv = malloc(sizeof(*rv) * (max + 1));
memcpy(rv, s2 + begin, max);
rv[max] = '\0';
return rv;
}
int main(int argc, char *argv[]) {
if (argc == 3) {
const char *s1 = argv[1];
const char *s2 = argv[2];
char *p = longest_common_substr(s1, s2, strlen(s1), strlen(s2));
if (p) {
printf("lcs('%s','%s') -> '%s'\n", s1, s2, p);
}
}
return 0;
}
The above does have a time complexity of O(n1*n2) and a space complexity of O(1). Achieving better time complexity is usually a desired goal, which may explain why the Wikipedia article has a suboptimal example for the more complex case.
Upvotes: 1
Reputation: 75688
Your algorithm is a weird one. You count circular substrings on the longest string only. And if the strings are of equal length you count circular substrings on the first one only.
For instance:
for "bxa"
and "ab"
you find the solution "ab"
because you count circular substrings on bxa
but for "bxa"
and "zaby"
you don't consider circular substrings in "bxa"
and you don't find "ab"
as a solution
("abx", "bya")
and ("bya", "abx")
have different solutions even though only the position of the arguments changes. (you count circular substrings only on the first argument if both arguments are of equal length).
You must either:
not count circular substrings at all (the classic algorithm as described in the wiki link) or:
count circular substrings for both strings (this is a different algorithm then longest common substring)
So the complexity of your algorithm is not comparable with the wiki algorithm because it is a different algorithm solving a different problem.
Upvotes: 2