Reputation: 75466

Efficient Algorithm for String Concatenation with Overlap

We need to combine 3 columns in a database by concatenation. However, the 3 columns may contain overlapping parts and the parts should not be duplicated. For example,

  "a" + "b" + "c" => "abc"
  "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
  "abcdede" + "dedefgh" + "" => "abcdedefgh"
  "abcde" + "d" + "ghlmn" => "abcdedghlmn"
  "abcdef" + "" + "defghl" => "abcdefghl"

Our current algorithm is pretty slow because it uses brute-force to identify the overlapping part between 2 strings. Does any one know an efficient algorithm to do this?

Say we have 2 strings A and B. The algorithm needs to find the longest common substring S so that A ends with S and B starts with S.

Our current brute-force implementation in Java is attached for reference,

public static String concat(String s1, String s2) {
    if (s1 == null)
        return s2;
    if (s2 == null)
        return s1;
    int len = Math.min(s1.length(), s2.length());

    // Find the index for the end of overlapping part
    int index = -1;
    for (int i = len; i > 0; i--) {
        String substring = s2.substring(0, i);
        if (s1.endsWith(substring)) {
            index = i;
            break;
        }
    }
    StringBuilder sb = new StringBuilder(s1);
    if (index < 0) 
        sb.append(s2);
    else if (index <= s2.length())
        sb.append(s2.substring(index));
    return sb.toString();
}

Upvotes: 18

Answers (12)

Alexander Torstling

Reputation: 18898

Here is a Java implementation which finds the maximum overlap between two strings with length N and M in something like O(min(N,M)) operations ~ O(N).

I had the same idea as @sepp2k:s now deleted answer, and worked a bit further on it. Seems to work fine. The idea is to iterate through the first string and start tracking once you find something which matches the start of the second string. Figured out that you might need to do multiple simultaneous trackings if false and true matches are overlapping. At the end you choose the longest track.

I havent worked out the absolutely worst case yet, with maximal overlap between matches, but I don't expect it to spiral out of control since I think that you cannot overlap arbitrary many matches. Normally you only track one or two matches at a time: the candidates are removed as soon as there is a mismatch.

static class Candidate {
    int matchLen = 0;
}

private String overlapOnce(@NotNull final String a, @NotNull final String b) {
    final int maxOverlap = Math.min(a.length(), b.length());
    final Collection<Candidate> candidates = new LinkedList<>();
    for (int i = a.length() - maxOverlap; i < a.length(); ++i) {
        if (a.charAt(i) == b.charAt(0)) {
            candidates.add(new Candidate());
        }
        for (final Iterator<Candidate> it = candidates.iterator(); it.hasNext(); ) {
            final Candidate candidate = it.next();
            if (a.charAt(i) == b.charAt(candidate.matchLen)) {
                //advance
                ++candidate.matchLen;
            } else {
                //not matching anymore, remove
                it.remove();
            }
        }

    }
    final int matchLen = candidates.isEmpty() ? 0 :
            candidates.stream().map(c -> c.matchLen).max(Comparator.comparingInt(l -> l)).get();
    return a + b.substring(matchLen);
}

private String overlapOnce(@NotNull final String... strings) {
    return Arrays.stream(strings).reduce("", this::overlapOnce);
}

And some tests:

@Test
public void testOverlapOnce() throws Exception {
    assertEquals("", overlapOnce("", ""));
    assertEquals("ab", overlapOnce("a", "b"));
    assertEquals("abc", overlapOnce("ab", "bc"));
    assertEquals("abcdefghqabcdefghi", overlapOnce("abcdefgh", "efghqabcdefghi"));
    assertEquals("aaaaaabaaaaaa", overlapOnce("aaaaaab", "baaaaaa"));
    assertEquals("ccc", overlapOnce("ccc", "ccc"));
    assertEquals("abcabc", overlapOnce("abcabc", "abcabc"));

    /**
     *  "a" + "b" + "c" => "abc"
     "abcde" + "defgh" + "ghlmn" => "abcdefghlmn"
     "abcdede" + "dedefgh" + "" => "abcdedefgh"
     "abcde" + "d" + "ghlmn" => "abcdedghlmn"
     "abcdef" + "" + "defghl" => "abcdefghl"
     */
    assertEquals("abc", overlapOnce("a", "b", "c"));
    assertEquals("abcdefghlmn", overlapOnce("abcde", "defgh", "ghlmn"));
    assertEquals("abcdedefgh", overlapOnce("abcdede", "dedefgh"));
    assertEquals("abcdedghlmn", overlapOnce("abcde", "d", "ghlmn"));
    assertEquals("abcdefghl", overlapOnce("abcdef", "", "defghl"));


    // Consider str1=abXabXabXac and str2=XabXac. Your approach will output abXabXabXacXabXac because by
    // resetting j=0, it goes to far back.
    assertEquals("abXabXabXac", overlapOnce("abXabXabXac", "XabXac"));

    // Try to trick algo with an earlier false match overlapping with the real match
    //  - match first "aba" and miss that the last "a" is the start of the
    // real match
    assertEquals("ababa--", overlapOnce("ababa", "aba--"));
}

Upvotes: 0

Craig Gidney

Reputation: 18296

Most of the other answers have focused on constant-factor optimizations, but it's also possible to do asymptotically better. Look at your algorithm: it's O(N^2). This seems like a problem that can be solved much faster than that!

Consider Knuth Morris Pratt. It keeps track of the maximum amount of substring we have matched so far throughout. That means it knows how much of S1 has been matched at the end of S2, and that's the value we're looking for! Just modify the algorithm to continue instead of returning when it matches the substring early on, and have it return the amount matched instead of 0 at the end.

That gives you an O(n) algorithm. Nice!

    int OverlappedStringLength(string s1, string s2) {
        //Trim s1 so it isn't longer than s2
        if (s1.Length > s2.Length) s1 = s1.Substring(s1.Length - s2.Length);

        int[] T = ComputeBackTrackTable(s2); //O(n)

        int m = 0;
        int i = 0;
        while (m + i < s1.Length) {
            if (s2[i] == s1[m + i]) {
                i += 1;
                //<-- removed the return case here, because |s1| <= |s2|
            } else {
                m += i - T[i];
                if (i > 0) i = T[i];
            }
        }

        return i; //<-- changed the return here to return characters matched
    }

    int[] ComputeBackTrackTable(string s) {
        var T = new int[s.Length];
        int cnd = 0;
        T[0] = -1;
        T[1] = 0;
        int pos = 2;
        while (pos < s.Length) {
            if (s[pos - 1] == s[cnd]) {
                T[pos] = cnd + 1;
                pos += 1;
                cnd += 1;
            } else if (cnd > 0) {
                cnd = T[cnd];
            } else {
                T[pos] = 0;
                pos += 1;
            }
        }

        return T;
    }

OverlappedStringLength("abcdef", "defghl") returns 3

Upvotes: 28

monis rahman

Reputation: 29

Heres a perl -pseudo oneliner:

$_ = s1.s2;

s/([\S]+)\1/\1/;

perl regex's are pretty efficient, you can look up what algo they are using but they definitely implement some type of FSM etc so will get u results in pretty good O(..).

Upvotes: 0

Jonathan

Reputation:

This problem seems like a variation of the longest common sub-sequence problem, which can be solved via dynamic programming.

http://www.algorithmist.com/index.php/Longest_Common_Subsequence

Upvotes: 0

Kirk Broadhurst

Reputation: 28728

I'm trying to make this C# as pleasant to read as possible.

    public static string Concatenate(string s1, string s2)
    {
        if (string.IsNullOrEmpty(s1)) return s2;
        if (string.IsNullOrEmpty(s2)) return s1;
        if (s1.Contains(s2)) return s1;
        if (s2.Contains(s1)) return s2;

        char endChar = s1.ToCharArray().Last();
        char startChar = s2.ToCharArray().First();

        int s1FirstIndexOfStartChar = s1.IndexOf(startChar);
        int overlapLength = s1.Length - s1FirstIndexOfStartChar;

        while (overlapLength >= 0 && s1FirstIndexOfStartChar >=0)
        {
            if (CheckOverlap(s1, s2, overlapLength))
            {
                return s1 + s2.Substring(overlapLength);
            }

            s1FirstIndexOfStartChar = 
                s1.IndexOf(startChar, s1FirstIndexOfStartChar);
            overlapLength = s1.Length - s1FirstIndexOfStartChar;

        }

        return s1 + s2;
    }

    private static bool CheckOverlap(string s1, string s2, int overlapLength)
    {
        if (overlapLength <= 0)
            return false;

        if (s1.Substring(s1.Length - overlapLength) == 
            s2.Substring(0, overlapLength))
            return true;

        return false;            
    }

EDIT: I see that this is almost the same as jerryjvl's solution. The only difference is that this will work with the "abcde", "d" case.

Upvotes: 1

FogleBird

Reputation: 76792

Here's a solution in Python. It should be faster just by not needing to build substrings in memory all the time. The work is done in the _concat function, which concatenates two strings. The concat function is a helper that concatenates any number of strings.

def concat(*args):
    result = ''
    for arg in args:
        result = _concat(result, arg)
    return result

def _concat(a, b):
    la = len(a)
    lb = len(b)
    for i in range(la):
        j = i
        k = 0
        while j < la and k < lb and a[j] == b[k]:
            j += 1
            k += 1
        if j == la:
            n = k
            break
    else:
        n = 0
    return a + b[n:]

if __name__ == '__main__':
    assert concat('a', 'b', 'c') == 'abc'
    assert concat('abcde', 'defgh', 'ghlmn') == 'abcdefghlmn'
    assert concat('abcdede', 'dedefgh', '') == 'abcdedefgh'
    assert concat('abcde', 'd', 'ghlmn') == 'abcdedghlmn'
    assert concat('abcdef', '', 'defghl') == 'abcdefghl'

Upvotes: 3

Phil

Reputation: 174

I think this will be pretty quick:

You have two strings, string1 and string2. Look backwards (right to left) through string1 for the first character of string2. Once you have that position, determine if there is overlap. If there isn't, you need to keep searching. If there is you need to determine if there is any possibility for another match.

To do that, simply explore the shorter of the two strings for a recurrence of the overlapping characters. ie: If the location of the match in string1 leaves a short string1 remaining, repeat the initial search from the new starting point in string1. Conversely, if the unmatched portion of string2 is shorter, search it for a repeat of the overlapping characters.

Repeat as required.

Job done!

This doesn't require much in terms of memory allocation (all searching done in place, just need to allocate the resultant string buffer) and only requires (at most) one pass of one of the strings being overlapped.

Upvotes: 1

jerryjvl

Reputation: 20141

How about (pardon the C#):

public static string OverlapConcat(string s1, string s2)
{
    // Handle nulls... never return a null
    if (string.IsNullOrEmpty(s1))
    {
        if (string.IsNullOrEmpty(s2))
            return string.Empty;
        else
            return s2;
    }
    if (string.IsNullOrEmpty(s2))
        return s1;

    // Checks above guarantee both strings have at least one character
    int len1 = s1.Length - 1;
    char last1 = s1[len1];
    char first2 = s2[0];

    // Find the first potential match, bounded by the length of s1
    int indexOfLast2 = s2.LastIndexOf(last1, Math.Min(len1, s2.Length - 1));
    while (indexOfLast2 != -1)
    {
        if (s1[len1 - indexOfLast2] == first2)
        {
            // After the quick check, do a full check
            int ix = indexOfLast2;
            while ((ix != -1) && (s1[len1 - indexOfLast2 + ix] == s2[ix]))
                ix--;
            if (ix == -1)
                return s1 + s2.Substring(indexOfLast2 + 1);
        }

        // Search for the next possible match
        indexOfLast2 = s2.LastIndexOf(last1, indexOfLast2 - 1);
    }

    // No match found, so concatenate the full strings
    return s1 + s2;
}

This implementation does not make any string copies (partial or otherwise) until it has established what needs copying, which should help performance a lot.

Also, the match check first tests the extremeties of the potentially matched area (2 single characters) which in normal english text should give a good chance of avoiding checking any other characters for mismatches.

Only once it establishes the longest match it can make, or that no match is possible at all, will two strings be concatenated. I have used simple '+' here, because I think the optimisation of the rest of the algorithm has already removed most of the inefficiencies in your original. Give this a try and let me know if it is good enough for your purposes.

Upvotes: 2

bjelli

Reputation: 10090

Or you could do it in mysql with the following stored function:

DELIMITER //

DROP FUNCTION IF EXISTS concat_with_overlap //

CREATE FUNCTION concat_with_overlap(a VARCHAR(100), b VARCHAR(100))
  RETURNS VARCHAR(200) DETERMINISTIC
BEGIN 
  DECLARE i INT;
  DECLARE al INT;
  DECLARE bl INT;
  SET al = LENGTH(a);
  SET bl = LENGTH(a);
  IF al=0 THEN 
    RETURN b;
  END IF;
  IF bl=0 THEN 
    RETURN a;
  END IF;
  IF al < bl THEN
     SET i = al;
  ELSE
     SET i = bl;
  END IF;

  search: WHILE i > 0 DO
     IF RIGHT(a,i) = LEFT(b,i) THEN
    RETURN CONCAT(a, SUBSTR(b,i+1));
     END IF;
     SET i = i - 1;
  END WHILE search;

  RETURN CONCAT(a,b);
END//

I tried it with your test data:

mysql> select a,b,c,
    -> concat_with_overlap( concat_with_overlap( a, b ), c ) as result 
    -> from testing //
+-------------+---------+--------+-------------+
| a           | b       | c      | result      |
+-------------+---------+--------+-------------+
| a           | b       | c      | abc         |
| abcde       | defgh   | ghlmn  | abcdefghlmn |
| abcdede     | dedefgh |        | abcdedefgh  |
| abcde       | d       | ghlmn  | abcdedghlmn |
| abcdef      |         | defghl | abcdefghl   |
| abXabXabXac | XabXac  |        | abXabXabXac |
+-------------+---------+--------+-------------+
6 rows in set (0.00 sec)

Upvotes: 1

Daniel C. Sobral

Reputation: 297225

You may use a DFA. For example, a string XYZ should be read by the regular expression ^((A)?B)?C. That regular expression will match the longest prefix which matches a suffix of the XYZ string. With such a regular expression you can either match and get the match result, or generate a DFA, on which you can use the state to indicate the proper position for the "cut".

In Scala, the first implementation -- using regex directly -- might go like this:

def toRegex(s1: String) = "^" + s1.map(_.toString).reduceLeft((a, b) => "("+a+")?"+b) r
def concatWithoutMatch(s1 : String, s2: String) = {
  val regex = toRegex(s1)
  val prefix = regex findFirstIn s2 getOrElse ""
  s1 + s2.drop(prefix length)
}

For example:

scala> concatWithoutMatch("abXabXabXac", "XabXacd")
res9: java.lang.String = abXabXabXacd

scala> concatWithoutMatch("abc", "def")
res10: java.lang.String = abcdef

scala> concatWithoutMatch(concatWithoutMatch("abcde", "defgh"), "ghlmn")
res11: java.lang.String = abcdefghlmn

Upvotes: 4

bjelli

Reputation: 10090

If you're doing it outside the database, try perl:

sub concat {
  my($x,$y) = @_;

  return $x if $y eq '';
  return $y if $x eq '';

  my($i) = length($x) < length($y) ?  length($x) : length($y);
  while($i > 0) {
      if( substr($x,-$i) eq substr($y,0,$i) )  {
          return $x . substr($y,$i);
      }
      $i--;
  }
  return $x . $y;
}

It's exactly the same algorithms as yours, I'm just curios if java or perl is faster ;-)

Upvotes: 0

James Black

Reputation: 41858

Why not just do something like this. First get the first character or word (which is going to signify the overlap) in the three columns.

Then, start to add the first string to a stringbuffer, one character at a time.

Each time look to see if you reached a part that is overlapped with the second or third string.

If so then start concatenating the string that also contains what is in the first string.

When done start, if no overlap, start with the second string and then the third string.

So in the second example in the question I will keep d and g in two variables.

Then, as I add the first string abc come from the first string, then I see that d is also in the second string so I shift to adding from the second string def are added from string 2, then I move on and finish with string 3.

If you are doing this in a database why not just use a stored procedure to do this?

Upvotes: 0

Efficient Algorithm for String Concatenation with Overlap

Answers (12)

Related Questions