Number of Longest Common Subsequences

I need to find the number of distinct longest common subsequences between two strings A and B. I am currently using the normal dynamic programming approach, and then generating all the distinct substrings by using a backtrack array and then doing a depth first search from the starting index.

However, since the number of possible such answers is very high, my code is too slow. Is there any way to count the number of such distinct longest common subsequences without actually generating them ?

My code so far:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Stack;

class Node
{
String res = "";
int i;
int j;

public Node( int _i, int _j, String s )
{
    i = _i;
    j = _j;
    res = s;
}
}

public class LCSRevisited
{
static String a;
static String b;
static int m,n;
static int[][] memo;
static int[][] bt; // 1 means [i+1][j], 2 means [i][j+1], 3 means [i+1][j+1]
// 4  - means both

static HashSet <String> filter;

static void printAllStrings( )
{
    Iterator i = filter.iterator();

    while( i.hasNext())
    {
        System.out.println( i.next() );
    }
} 

 static void printSol()
 {
   System.out.print( memo[ 0 ][ 0 ]);

   // check how many UNIQUE such strings exist

   filter = new HashSet();
   Stack<Node> s = new Stack();
   Node start = new Node( 0, 0, "" );
   s.push( start );
   Node curr;
   String res;

   // use backtrack array to do a DFS

   while( !s.isEmpty() )
   {
        curr = s.pop();
        res = curr.res;

        if( ( curr.i>=m) || ( curr.j >=n ) )
        {
            filter.add( curr.res);
            continue;
       }

        // check backtrack value
        int i = curr.i;
        int j = curr.j;
        int back = bt[ i ][ j];

        if( back == 1 )
        {
            s.push( new Node( i+1, j, res ));
        }
        if( back == 2 )
        {
            s.push( new Node( i, j+1, res ));
        }
        if( back == 3 )
        {
            s.push( new Node( i+1, j+1, res+a.charAt(i) ));
        }
        if( back == 4 )
        {
            s.push( new Node( i, j+1, res ));
            s.push( new Node( i+1, j, res ));
        }
   }
   //printAllStrings();
   System.out.println(" " + filter.size() );
}

static void solve()
{
   // fill base cases
   m = a.length();
   n = b.length();
   memo = new int[ m+1 ][ n+1 ];
   Arrays.fill( memo[m], 0 );

   bt = new int[ m+1 ][ n+1 ];

   for( int i=0; i<m; i++ )
   {
       memo[ i ][ n ] = 0;    
   }

   // Now compute memo values
   for( int i=m-1; i>=0; i-- )
   {
       for( int j=n-1; j>=0; j-- )
       {
           if( a.charAt(i) == b.charAt(j))
           {
               memo[ i ][ j ] = 1 + memo[ i+1 ][ j+1 ];
               bt[ i ][ j ] = 3;
           }
           else
           {
               int r1 = memo[ i+1 ][ j ];
               int r2 = memo[ i ][ j+1 ];

               if( r1==r2 )
               {
                    memo[ i ][ j ] = r1;
                    bt[ i ][ j ] = 4;
               }
               else if( r1 > r2 )
               {
                   memo[ i ][ j ] = r1;
                   bt[ i ][ j ] = 1;
               }
               else
               {
                   memo[ i ][ j ] = r2;
                   bt[ i ][ j ] = 2;
               }
           }
       }
   }

   printSol();
 }

public static void main( String[] args ) throws IOException
{
 BufferedReader br = new BufferedReader( new InputStreamReader( System.in ));

int T= Integer.parseInt( br.readLine() );

while( T--> 0 )
{
    a = br.readLine();
    b = br.readLine();

    if( T>=1 )
    br.readLine();

    solve();
    // printArr( bt );
}
}
}

Upvotes: 3

Answers (2)

Chris

Reputation: 26878

I'm thinking you can use a rolling hash function like Rabin Karp's. That way, you can calculate new hash values of longer common subsequences without having to regenerate the whole string again and hashing it.

Actually, I think you can use pure DP to find your answer. Suppose you already calculated the values of the DP table for LCS (memo[][] in your code, I think). Then you can calculate the number of distinct LCS instances like this

for j ← 0 to n do
    for i ← 0 to m do
        if i = 0 or j = 0 then
            D[i, j] ← 1
        else
            D[i, j] ← 0
            if ai  = bj  then
                D[i, j] ← D[i − 1, j − 1]
            else if L[i − 1, j] = L[i, j] then
                D[i, j] ← D[i, j] + D[i − 1, j]
            endif
            if L[i, j − 1] = L[i, j] then
                D[i, j] ← D[i, j] + D[i, j − 1]
            endif
            if L[i − 1, j − 1] = L[i, j] then
                D[i, j] ← D[i, j] − D[i − 1, j − 1]
            endif
        end if
    endfor
endfor

Your answer is D[n, m]. Hope either my ideas helped!

Upvotes: 1

seviyor

Reputation: 96

Maybe using a kind of trie can help you generate the actual sequences while you compute the length, and after compute the total number with a traversal in the trie (linear time).

Now, memo[i][j] represents the length of the common subsequence of A[i...m] and B[j..n]. I propose that you also have a lp[i][j] representing a list of pointers each to a node in the trie such that the path from that node to the root of the trie gives you one of the longest common subseq for A[i...m] and B[j..n]. To build this lp, for case 1 and 2 you simply copy the lists from lp[i+1][j] or lp[i][j+1], both for case 4, while for 3 you add a new node in the tree with the value A[i]=B[j] and in the tree set all the nodes pointed by lp[i+1][j+1] as sons of the new node. These operations would be linear (or maybe even faster with some advanced data structures for dealing with sets). Note that what I've described isn't actually a trie/tree (a node can have multiple parents).

Finaly, for counting, I think that a traversal would be good, with some additional processing - propagating the counts: count[node][level] = sum(count[sons(node)][level-1]) or, if node is a leaf count[node][1]=1, count[node][l!=1]=0. And your answear would be summing count for those "root" nodes (in-degree 0) that satisfy the "longest" constraint (\sum_x{count[x][l_max]}).

I'm not 100% sure of my solution, but it may be a good start for an improved answer to your problem.

Upvotes: 0

Number of Longest Common Subsequences

Answers (2)

Related Questions