meld24
meld24

Reputation: 27

Itertools stop characters repeating in a row

I wrote the following code to make all strings of 20 characters long with combinations of A, T, G and C.

However, I would like to avoid more than 3 of the same characters in a row, so I added an if function to check for this. The trouble is, this is after the itertools code, so it's a bit slow. I wonder whether there is a way to use itertools to produce this result without having to run itertools and then an if function?

import sys
import itertools
import re

x = ["A","T","G","C"]
for i in itertools.product(x, repeat=20):
        i = "".join(i)
        if re.search(r"(\w)\1\1",i):
                continue
        else:
                sys.stdout.write(i)

Upvotes: 1

Views: 166

Answers (1)

Gareth McCaughan
Gareth McCaughan

Reputation: 19981

At face value, the question appears to be asking this:

How can I filter this enormous list of strings without the pain of having to construct the whole list first?

The answer to that is: you're already doing it! The things in itertools produce lazily-generated sequences that are constructed iteratively. So your existing code is not producing an enormous list of billions of strings.

But there's a possibly more interesting question you might be wanting to ask:

If I generate the triplet-free strings by generating all the strings and filtering out the ones with triplets in, my code is having to do extra work because most of the strings generated will have triplets in them. Suppose the strings are generated in lexicographic order; then the first 4**17 of them will begin AAA, and we really ought to be able to skip over all of those. How can we do better?

Unfortunately, if you want to do this then you will have to write your own code to do it; itertools doesn't offer this sort of "pattern-filtered product" functionality.

It might look something like this:

# generate all n-tuples with the property that their k-th element
# is one of the things returned by successors(initial (k-1)-tuple).
# So e.g. the first element is one of the things returned by
# successors(()).
def restricted_tuples(successors, n):
    assert(n>=0)
    if n==0:
        for t in successors(()): yield (t,)
    else:
        for start in restricted_tuples(successors, n-1):
            for t in successors(start): yield start+(t,)

def successors_no_triples(start, alphabet):
    if len(start)<2 or start[-1] != start[-2]:
        for t in alphabet: yield t
    else:
        banned = start[-1]
        for t in alphabet:
            if t != banned: yield t

print([''.join(x) for x in restricted_tuples(lambda start: successors_no_triples(start,'ABC'), 5)])

The print at the end is illustrative only. If you wanted to print out all the billions of strings from the original questioner's case, you'd do better to iterate over the sequence generated by restricted_tuples and stringify-and-print each one separately.

Incidentally, the number of length-20 sequences on 4 letters with this property turns out to be 415,289,569,968. You'll be waiting a while if you try to generate them all, especially if you actually want to do anything with each one.

Upvotes: 2

Related Questions