Kavish  Dwivedi
Kavish Dwivedi

Reputation: 745

Counting distinct common subsequences for a given set of strings

I was going through this paper about counting number of distinct common subsequences between two strings which has described a DP approach to do the same. Now, when there are more than two strings whose number of distinct common subsequences must be found, it might take an approach different from this one. What I want is that whether this task is achievable in time complexity less than exponential and how can it be done?

Upvotes: 0

Views: 1627

Answers (1)

btilly
btilly

Reputation: 46389

If you have an alphabet of size k, and m strings of size at most n then (assuming that all individual math operations are O(1)) this problem is solvable with dynamic programming in time at most O(k nm+1) and memory O(k nm). Those are not tight bounds, and in practice performance and memory should be significantly better than that. But in practice with long strings you will wind up needing big integer arithmetic, which will make math operations not O(1). Still it is polynomial.

Here is the trick in an unfortunately confusing sentence. We want to build up a series of tables listing, for each possible length of subsequence and each set of ways to pick one copy of a character from each string, the number of distinct subsequences there are whose minimal expression in each string ends at the chosen spot. If we do that, then the sum of all of those values is our final answer.

Here is an outline of how to do it (which you can do without understanding the above description).

  1. For each string, build a transition table mapping (position in string, character) to the position of the next occurrence of that character. The tables should start with position 0 being before the first character. You can use -1 for running off of the end of the string.

  2. Create a data structure that maps a list of integers the same size as the number of strings you have to another integer. This will be the count of subsequences of a fixed length whose shortest representation in each string ends at that set of positions.

  3. Insert as the sole value (0, 0, ..., 0) -> 1 to represent the fact that there is 1 subsequence of length 0 and its shortest representation in each string ends at the start.

  4. Set the total count of common subsequences to 0.

  5. While that map is not empty:

    1. Add the sum of values in that map to the total count of common subsequences.

    2. Create a second map of the same type, with no data.

    3. For each key/value pair in the first map:

      1. For each possible character in your alphabet:

        1. Construct a new vector of integers to be a new key by taking each string, looking at the position, then taking the next position of that character. Of course if you run off of the end of the string, break out of the loop.

        2. If that key is not in your second map, insert it with value 0.

        3. Increase the value for that key in the second map by your current value in the current map. (Basically add the number of subsequences that just had this minimal character transition.)

    4. Copy the second data structure to the first.

  6. The total count of distinct subsequences in common across all of the strings should now be correct.

Upvotes: 1

Related Questions