string combinations that include a substring over a finite alphabet in python

Question

Let's assume we have an alphabet of 20 letters. Also let's assume that we have the following substring CCAY. I would like to calculate the number of the words which have length N letters and include the specific substring.

To be more precise, if the N = 6 I would like the following combinations CCAYxx, xCCAYx, xxCCAY where x is any letter of the alphabet. If N = 7 the combinations adjust as follows CCAYxxx, xCCAYxx, xxCCAYx, xxxCCAY and so on.

Also, I can think a pitfall when the substring consists of only one letter of the alphabet e.g CCCC which means that in case of N = 6 the string CCCCCC should not be counted multiple times.

I would appreciate any help or guidance on how to approach this problem. Any sample code in python would be also highly appreciated.

Arne · Accepted Answer

You said brute force is okay, so here we go:

alphabet = 'abc'
substring = 'ccc'
n = 7

res = set()
for combination in itertools.product(alphabet, repeat=n-len(substring)):
    # get the carthesian product of the alphabet such that we end up 
    # with a total length of 'n' for the final combination
    for idx in range(len(combination)+1):
        res.add(''.join((*combination[:idx], substring, *combination[idx:])))
print(len(res))

Prints:

For a substring with no repetitions, like abc, I get 396 as result, so I assume it covers to corner case appropriately.

That this is inefficient enough to make mathematicians weep goes without saying, but as long as your problems are small in length it should get the job done.

Analytical approach

The maximum number of combinations is given by the ways of unique ordered combinations of length n, given len(alphabet) = k symbols, which is k^n. Additionally, the 'substring' can be inserted into the combinations at any point, which leads to a total maximum of (n+1)*k^n. The latter only holds if the substring does not produce identical final combinations at any point, which makes this problem hard to compute analytically. So, the vague answer is your result will be somewhere between k^n and (n+1)*k^n.

If you want to count the number of identical final combinations that include the substring, you can do so by counting the number of repetitions of the substring within a preliminary product:

n = 6
pre_prod = 'abab'
sub = 'ab'
pre_prods = ['ababab', 'aabbab', 'ababab', 'abaabb', 'ababab']
prods = ['ababab', 'aabbab', 'abaabb']
# len(pre_prodd) - pre_prod.count(sub) -> len(prods) aka 5 - 2 = 3

I will see if I can find a formula for that .. sometime soon.

string combinations that include a substring over a finite alphabet in python

Answers (1)

Analytical approach

Related Questions