user5927908
user5927908

Reputation:

Creating a list of sequences such that every distinct pair is counted a number of times

Having some issues coming up with an algorithm to create a list of sequences such that every distinct pair is counted at a repeat number of times, where every element in the sequence can only have 1 of 2 possible other elements in it.

To be more clear, if I have the grammar such that the alphabet is [A, B, C, D, E, F]:

A -> B, E

B -> C, D

C -> A, F

D -> F, B

E -> D, A

F -> E, C

means the only possible pairs I can/should have are [(A, B), (A, E), (B, C), (B, D), (C, A), (C, F), (D, F), (D, B), (E, D), (E, A), (F, E), (F, C)].

So for example, a sequence could look something like:

A, B, D, F, E, A, E, D, F, E, A, E, ...

(A, B), (B, D), (E, D) have all occurred once so far and (A, E), (E, A), (F, E), (D, F) are no longer usable (as we've used them twice). We continue this pattern until we have no more usable pairs. For instance, the next element in the above sequence must be D, since we've exhausted the option (E, A).

From that, I could want to get all sequences starting with A up to a length of 24 that follows the above grammar, such that every pair in the sequence is repeated exactly twice.

My initial idea was to just make a bunch of nodes, then perform a generic DFS/BFS on it to get every path while counting the number of pairs that I see while going down the tree, return when there's too many pairs and add to a list of paths when I've reached a depth of 24. Are there better ways to get this done? I feel like I'm overthinking it.

(Note: It could also be a variable depth, and I could start with any element in the alphabet.)

Upvotes: 1

Views: 481

Answers (2)

Mateen Ulhaq
Mateen Ulhaq

Reputation: 27201

Let's start with a simple method that disregards the max_count=2 restriction:

from pprint import pprint
from collections import defaultdict

def list_cycles(grammar, parent, length):
    if length == 1:
        return [parent]

    return [parent + x
        for node in grammar[parent]
        for x in list_cycles(grammar, node, length - 1)]

grammar = {
    'A': ['B', 'E'],
    'B': ['C', 'D'],
    'C': ['A', 'F'],
    'D': ['F', 'B'],
    'E': ['D', 'A'],
    'F': ['E', 'C'],
    }

cycles = list_cycles(grammar, 'A', 6)
print(f'{len(cycles)}\n')
pprint(cycles)

Which outputs:

32

['ABCABC',
 'ABCABD',
 'ABCAED',
 'ABCAEA',
 'ABCFED',
 'ABCFEA',
 'ABCFCA',
 'ABCFCF',
 'ABDFED',
 'ABDFEA',
 'ABDFCA',
 'ABDFCF',
 'ABDBCA',
 'ABDBCF',
 'ABDBDF',
 'ABDBDB',
 'AEDFED',
 'AEDFEA',
 'AEDFCA',
 'AEDFCF',
 'AEDBCA',
 'AEDBCF',
 'AEDBDF',
 'AEDBDB',
 'AEABCA',
 'AEABCF',
 'AEABDF',
 'AEABDB',
 'AEAEDF',
 'AEAEDB',
 'AEAEAB',
 'AEAEAE']

Our requirements require that we get rid of entries like 'AEAEAE'. So let's write a simple DFS-like implementation:

from pprint import pprint
from collections import defaultdict

def adjusted_counts(counts, item):
    counts = counts.copy()
    counts[item] += 1
    return counts

def list_cycles(grammar, max_count, parent, length, counts):
    if length == 1:
        return [parent]

    return [parent + x
        for node in grammar[parent]
        for x in list_cycles(grammar, max_count, node, length - 1,
            adjusted_counts(counts, parent + node))
        if counts[parent + node] < max_count]

grammar = {
    'A': ['B', 'E'],
    'B': ['C', 'D'],
    'C': ['A', 'F'],
    'D': ['F', 'B'],
    'E': ['D', 'A'],
    'F': ['E', 'C'],
    }

cycles = list_cycles(grammar, 2, 'A', 6, defaultdict(int))
print(f'{len(cycles)}\n')
pprint(cycles)

Which outputs:

31

['ABCABC',
 'ABCABD',
 'ABCAED',
 'ABCAEA',
 'ABCFED',
 'ABCFEA',
 'ABCFCA',
 'ABCFCF',
 'ABDFED',
 'ABDFEA',
 'ABDFCA',
 'ABDFCF',
 'ABDBCA',
 'ABDBCF',
 'ABDBDF',
 'ABDBDB',
 'AEDFED',
 'AEDFEA',
 'AEDFCA',
 'AEDFCF',
 'AEDBCA',
 'AEDBCF',
 'AEDBDF',
 'AEDBDB',
 'AEABCA',
 'AEABCF',
 'AEABDF',
 'AEABDB',
 'AEAEDF',
 'AEAEDB',
 'AEAEAB']

Finally, running:

cycles = list_cycles(grammar, 2, 'A', 24, defaultdict(int))
print(len(cycles))

Gives:

18954

Concluding thoughts:

  • Performance might be improved via generators like @PM 2Ring's version
  • Performance might be further improved via a more clever algorithm that only works for this specific grammar, max_count=2, and length=24. One that I was thinking about was: enumerate all permutations which use every pair only once. Then, attempt to insert connecting cycles once in between the various positions.

Upvotes: 0

PM 2Ring
PM 2Ring

Reputation: 55479

Here's a recursive generator that makes your sequences. It was even easier than I thought: no backtracking required. :) To save memory, we use a single dict counts to keep track of the counts of each pair. We increment the count before the recursive call, and reset it after the recursive call.

The sequence length needs to be 25 to get each of the 12 pairs exactly twice in a sequence. The total number of sequences starting with a given letter is 18954.

grammar = {
    'A': 'BE',
    'B': 'CD',
    'C': 'AF',
    'D': 'BF',
    'E': 'AD',
    'F': 'CE',
}

pairs = [k + c for k, v in grammar.items() for c in v]
counts = dict.fromkeys(pairs, 0)

maxcount = 2
total_len = 25

def sequences(seq):
    if len(seq) == total_len:
        yield seq
        return

    k = seq[-1]
    for c in grammar[k]:
        pair = k + c
        if counts[pair] < maxcount:
            counts[pair] += 1
            yield from sequences(seq + c)
            counts[pair] -= 1

for i, seq in enumerate(sequences('A'), 1):
    print(i, seq)   

some output

1 ABCABCAEAEDBDBDFCFCFEDFEA
2 ABCABCAEAEDBDBDFCFEDFCFEA
3 ABCABCAEAEDBDBDFEDFCFCFEA
4 ABCABCAEAEDBDFCFCFEDBDFEA
5 ABCABCAEAEDBDFCFEDBDFCFEA
6 ABCABCAEAEDBDFEDBDFCFCFEA
7 ABCABCAEAEDFCFCFEDBDBDFEA
8 ABCABCAEAEDFCFEDBDBDFCFEA
9 ABCABCAEAEDFEDBDBDFCFCFEA
10 ABCABCAEDBDBDFCFCFEAEDFEA

To get a quick count of the sequences without printing them all, you can do

print(sum(1 for _ in sequences('A')))

That runs in under 2 seconds on my old 2GHz 32 bit machine, running Python 3.6.0


Note that increasing maxcount considerably increases the number of sequences. Eg, for total_len = 19, maxcount = 2, the total number of sequences is 20622, for total_len = 19, maxcount = 3, the total number of sequences is 169192.

Upvotes: 2

Related Questions