Calculate the number of fixed-length strings that contain a given word exactly once

Question

For example given the fixed-length N=10 and the word W="radar". Also we know we have 26 characters.

Then we can "easily" calculate the number of strings as:

radar _ _ _ _ _ -> 26^5-26-1 (because radarradar and radaradar_ is not ok)

_ radar _ _ _ _ -> 26^5-26

_ _ radar _ _ _ -> 26^5

_ _ _ radar _ _ -> 26^5

_ _ _ _ radar _ -> 26^5-26

_ _ _ _ _ radar -> 26^5-26-1

Which is 71288150. Is there a better solution to calculate this? Even, if N can be very large.

kcsquared · Accepted Answer

This answer is largely based on the software engineering stack thread asking a related question, which itself is based on the KMP algorithm.

The idea is to build a deterministic finite automaton (DFA) whose transitions are the lowercase letters that can encode the property of matching a pattern zero, one or 2+ times. For example, with pattern = 'radar', we have a pattern length n of 5, and so we'll have n + n + 1 states in total in our DFA. It's helpful to think of these as 3 rows of states:

Row 0, column i is the state corresponding to 0 full matches, and the last i letters seen form our current partial match.
Row 1, column i is the state corresponding to 1 full match, and the last i letters seen form our current partial match.
Row 2 has a single state which can't be escaped: We have seen at least 2 full matches already.

Our 'row' can only ever increase. In the code, there's really only one flat list of 2n+1 states, but that's just an implementation detail.

A partial drawing of the DFA graph for 'radar' is shown: almost all edges are not drawn, since every state has 26 outward transitions. Usually at most two transitions do not lead back to the 'mismatch' states at the start of a row (here, 0 and 5), but I also couldn't include some back edges where networkx would draw them overlapping.

To get the answer, we need to count the frequencies of being in each possible state (starting from 0) after seeing length = 10 letters-- the good states are n, n+1 ... 2n-1, which correspond to exactly one full match. To generate the transition matrix, we can use the prefix function from the KMP algorithm, with almost no modifications. The definition of the DFA for the first row is as follows:

For state 0 <= i < n,
DFA[i][letter] = Length of longest suffix of (pattern[:i]+letter)
                 that is also a prefix of pattern

with the caveat that a full match in state n-1 moves us to row 1. The definition for states n <= i < 2n is the same, just with every state shifted up by n.

Using numpy, we can get an answer using matrix powers fairly quickly: we need to raise a roughly 2nx2n matrix to the power of our fixed length. If length is too large, you'll either need to use Python's big integers or take moduli to avoid overflow, once the answer gets too large. However, matrix exponentiation can be done with repeated squaring for a time complexity of O(n^3 * log(length)) using naive matrix multiplication.

def kmp_count(pattern: str, length: int) -> int:
    if not pattern.islower():
        raise ValueError("Pattern must be lowercase")

    n = len(pattern)

    if n > length:
        return 0
    if n == length:
        return 1
    if n == 1:
        return length * pow(25, length - 1)

    pattern_chars = [ord(x) - ord('a') for x in pattern]

    # Create the DFA
    dfa = [[0 for _ in range(26)] for _ in range(2 * n + 1)]

    # Final failure state is an absorbing state
    dfa[2 * n] = [2 * n for _ in range(26)]

    dfa[0][pattern_chars[0]] = 1
    restart_state = 0
    for i in range(1, n):
        dfa[i] = dfa[restart_state][:]

        dfa[i][pattern_chars[i]] = i + 1
        restart_state = dfa[restart_state][pattern_chars[i]]

    dfa[n - 1][pattern_chars[n - 1]] += restart_state

    for i in range(n, 2 * n):
        dfa[i] = [x + n for x in dfa[i - n]]

    # Two or more matches moves us to fail state
    dfa[2 * n - 1][pattern_chars[n - 1]] = 2 * n

    transitions = np.zeros((2 * n + 1, 2 * n + 1), dtype=np.int64)
    for i, x in enumerate(dfa):
        for y in x:
            transitions[i][y] += 1

    final_transitions = np.linalg.matrix_power(transitions, length)
    return final_transitions[0, n:2 * n].sum()

print(kmp_count(pattern='radar', length=10))
>>> 71288150

Calculate the number of fixed-length strings that contain a given word exactly once

Answers (1)

Related Questions