jiwon
jiwon

Reputation: 41

How can I find LCS(longest common subsequence) with gap constraint?

I know the general LCS algorithm and can make the table.

But how to find the LCS with a gap constraint as shown in the picture above?

How can I make a table?

An example of LCS with gap constraint:

enter image description here

Upvotes: 2

Views: 968

Answers (2)

relatively_random
relatively_random

Reputation: 5126

Longest common subsequence with a gap constraint can be thought of as longest common substring with a gap allowance. So we can modify the standard dynamic programming algorithm from Wikipedia.

I wrote the pseudocode to return just a single best substring/subsequence, but it should be adaptable to track all possible best solutions if required.

Should be equivalent to mcskinner's answer in terms of runtime complexity etc, but may be easier to implement in a lower level language because it doesn't require any data structures or recursion.

function LongestCommonSubstringWithGaps(S[1..r], T[1..n], gs, gt)
        
// Stores the length of the best solution ending with a match at i,j.
// If there is no match at i,j, stores zero.
L := array(1..r, 1..n)
// Stores the indices of the sub-solution used to construct each L.
Lkl := array(1..r, 1..n) of pairs

// Best solution length.
z := 0
// The indices of the last match of the best solution.
zij := (0, 0)

for i := 1..r
    kmin := max(1, i - gs - 1)
    for j := 1..n

        // Nothing to do if there is no match.
        if S[i] != T[j]
            // Setting to zero unnecessary if L is already initialized.
            L[i, j] := 0
            continue

        L[i, j] := 1
        // Setting to zero unnecessary if Lkl is already initialized.
        Lkl[i, j] := (0, 0)
        
        // Searching backwards for possible sub-solutions to build upon.
        lmin := max(1, j - gt - 1)
        for k := i-1..kmin
            for l := j-1..lmin
                if L[k, l] + 1 > L[i, j]
                    L[i, j] := L[k, l] + 1
                    Lkl[i, j] := (k, l)
                // In each row, every non-zero sub-solution can only be
                // better than anything else farther down but still
                // within the gap window.
                // So stopping each row on first non-zero.
                if L[k, l] > 0
                    break

        if L[i, j] > z
            z := L[i, j]
            zij := (i, j)

// Rebuilding the solution by following individual steps backwards.
ret := ''
while zij != (0, 0)
    ret := S[zij(1)] + ret
    zij := Lkl[zij(1), zij(2)]

return ret

Upvotes: 1

mcskinner
mcskinner

Reputation: 2748

It is possible to design a recursive algorithm to handle this constraint.

By memoizing (i.e. caching) the answers, you get the same big-O efficiency as you would with Dynamic Programming. It will have a larger constant factor due to the use of function calls instead of for loops. But the two are equivalent, and you can think of DP as an "inside out" recursion with memoization.

The basic idea is to create a gap "budget" for each of the sequences, and then account for this when skipping letters. There are 4 cases, which closely track the standard recursive approach to LCS.

  1. Terminate the match. This may not be the best option, but it is always available in case none of the other options have their conditions met.
  2. The current pair of characters match. The working LCS can be extended by 1, and the gap budgets can be reset to the maximum value k.
  3. There is still some budget to skip the left character, so try that. The LCS does not grow, and the left gap budget is reduced.
  4. Same as 2, but now try skipping the right character if there is budget.

Here's some Python code that does what you want. It could be made much more efficient, especially by using a better language than Python, but it should give you an idea.

def gap_lcs(lh, rh, k): 
    ln, rn = len(lh), len(rh) 

    memo = {}
    choices = {}
    def rec(li, ri, l_budget, r_budget):
        # At the end of either sequence we are forced to use
        # Case 1: terminate the match.
        if li >= ln or ri >= rn:
            return 0 

        # Cache results. This limits the complexity to O(ln * lm * k^2).
        # Without this the recursion would take exponential time.
        key = (li, ri, l_budget, r_budget) 
        if key in memo: 
            return memo[key]

        # Case 1: terminate the match.
        res = 0
        choice = (0, 0)

        # Case 2: matching characters, extend the sequence.
        if lh[li] == rh[ri]:
            test = 1 + rec(li + 1, ri + 1, k, k)
            if test > res:
                res = test
                choice = (1, 1)

        # Case 3: skip the left character if there's still budget.
        if l_budget > 0:
            test = rec(li + 1, ri, l_budget - 1, r_budget)
            if test > res:
                res = test
                choice = (1, 0)

        # Case 4: skip the right character if there's still budget.
        if r_budget > 0:
            test = rec(li, ri + 1, l_budget, r_budget - 1)
            if test > res:
                res = test
                choice = (0, 1)

        memo[key] = res
        choices[key] = choice
        return res

    # Find the best combination of starting points within the two sequences.
    # This is so the gap constraint will not apply to skips at the start.
    res = 0
    best_li, best_ri = 0, 0
    for li in range(ln):
        for ri in range(rn):
            test = rec(li, ri, k, k)
            if test > res:
                res, best_li, best_ri = test, li, ri

    # Reconstruct the LCS by following the choices we tracked,
    # starting from the best start we found.
    li, ri = best_li, best_ri
    l_budget, r_budget = k, k

    path = []
    while True:
        key = (li, ri, l_budget, r_budget)

        # Case 1.
        if key not in choices:
            break

        inc_li, inc_ri = choices[key]

        # Case 1.
        if inc_li == 0 and inc_ri == 0:
            break

        if inc_li == 1 and inc_ri == 1:
            # Case 2.
            l_budget, r_budget = k, k
            path.append((lh[li], li, ri))
        else:
            # Cases 3 and 4.
            l_budget -= inc_li
            r_budget -= inc_ri

        li += inc_li
        ri += inc_ri

    return path

This produces the desired output for your examples.

gap_lcs('RCLPCRR', 'RPPLCPLRC', 2)
# [('R', 0, 0), ('P', 3, 1), ('C', 4, 4), ('R', 6, 7)]

gap_lcs('RCLPCRR', 'RPPLCPLRC', 1)
# [('C', 1, 4), ('P', 3, 5), ('R', 5, 7)]

gap_lcs('RCLPCRR', 'RPPLCPLRC', 0)
# [('R', 0, 7), ('C', 1, 8)]

Upvotes: 2

Related Questions