Reputation: 11
Is there an efficient way to compute the number of possible subsequences of a bit array?
The array is read from left to right, possibly omitting some elements. Duplicate subsequences are not allowed.
Brute-forcing through all possible subsequences takes a long time when array grows in size.
Upvotes: 0
Views: 1169
Reputation: 241701
This simple linear-time algorithm is taken from "Algorithms for subsequence combinatorics" by Cees Elzinga et al. (2008), modified slightly because mathematics tends to be 1-indexed, but Python is 0-indexed. It will work for any sequence s
, not just binary sequences:
def count_unique_subsequences(s):
"""Returns the number of unique subsequences of the sequence s"""
L = {}
N = []
count = 1
for c in s:
N.append(count)
count *= 2
if c in L:
count -= N[L[c] - 1]
L[c] = len(N)
return count
This is a dynamic programming solution, which iteratively computes the number of unique subsequences of each prefix of the current string. All of these subsequences are still subsequences of the next prefix, and additionally we can add any subsequence extended with the next character except for those subsequences which were not extended the last time we encountered the same character. (Because at that point, we counted all those subsequences extended with the character.) In this algorithm, the vector N
maintains the count of unique subsequences for each successive prefix of s
(indexed by the length of the prefix), while L
keeps track of the index of last occurrence of each character.
After thinking about this code, I realized that N
is really redundant; the only reason we need it is to be able to look up the subsequence count corresponding to the current character. But we could just store that count directly into L
instead of storing the index for a second table lookup. This doesn't change the time complexity of the algorithm (although it speeds it up slightly) but it does reduce the space complexity to O(|Σ|), where Σ is the alphabet. For binary sequences, that makes the algorithm linear-time/constant-space. Here's the modified algorithm:
def count_unique_subsequences(s):
"""Returns the number of unique subsequences of the sequence s"""
L = {}
count = 1
for c in s:
adds = count - L.get(c, 0)
L[c] = count
count += adds
return count
As presented, the function counts the empty subsequence which doesn't appear in your enumeration, so you might want to subtract one from the final result.
Amongst many other interesting results, the Elzinga paper also considers the maximum unique subsequence count for an alphabet of a given size, demonstrating that the maximum count is a generalized Fibonacci sequence. For alphabet size 2, the maximum count can be computed as:
max_count(0) = 1
max_count(1) = 2
max_count(n) = max_count(n - 2) + max_count(n - 1) + 1
which is fibonacci(n+2)-1
.
The string which generates the maximum pattern consists of a cyclic repetition of the alphabet.
Actually enumerating all the unique subsequences must therefore take exponential time, since there are (potentially) an exponential number of such sequences. However, the exponent (for binary sequences) is φ, which is less than 2.
Upvotes: 3