SherMM
SherMM

Reputation: 619

Python - build new string of specific length with n replacements from specific alphabet

I have been working on a fast, efficient way to solve the following problem, but as of yet, I have only been able to solve it using a rather slow, nest-loop solution. Anyways, here is the description:

So I have a string of length L, lets say 'BBBX'. I want to find all possible strings of length L, starting from 'BBBX', that differ at, at most, 2 positions and, at least, 0 positions. On top of that, when building the new strings, new characters must be selected from a specific alphabet.

I guess the size of the alphabet doesn't matter, so lets say in this case the alphabet is ['B', 'G', 'C', 'X'].

So, some sample output would be, 'BGBG', 'BGBC', 'BBGX', etc. For this example with a string of length 4 with up to 2 substitutions, my algorithm finds 67 possible new strings.

I have been trying to use itertools to solve this problem, but I am having a bit of difficulty finding a solution. I try to use itertools.combinations(range(4), 2) to find all the possible positions. I am then thinking of using product() from itertools to build all of the possibilities, but I am not sure if there is a way I could connect it somehow to the indices from the output of combinations().

Upvotes: 4

Views: 249

Answers (2)

Tim Peters
Tim Peters

Reputation: 70715

Not tested much, but it does find 67 for the example you gave. The easy way to connect the indices to the products is via zip():

def sub(s, alphabet, minsubs, maxsubs):
    from itertools import combinations, product
    origs = list(s)
    alphabet = set(alphabet)
    for nsubs in range(minsubs, maxsubs + 1):
        for ix in combinations(range(len(s)), nsubs):
            prods = [alphabet - set(origs[i]) for i in ix]
            s = origs[:]
            for newchars in product(*prods):
                for i, char in zip(ix, newchars):
                    s[i] = char
                yield "".join(s)

count = 0
for s in sub('BBBX', 'BGCX', 0, 2):
    count += 1
    print s
print count

Note: the major difference from FogleBird's is that I posted first - LOL ;-) The algorithms are very similar. Mine constructs the inputs to product() so that no substitution of a letter for itself is ever attempted; FogleBird's allows "identity" substitutions, but counts how many valid substitutions are made and then throws the result away if any identity substitutions occurred. On longer words and a larger number of substitutions, that can be much slower (potentially the difference between len(alphabet)**nsubs and (len(alphabet)-1)**nsubs times around the ... in product(): loop).

Upvotes: 1

FogleBird
FogleBird

Reputation: 76822

Here's my solution.

The first for loop tells us how many replacements we will perform. (0, 1 or 2 - we go through each)

The second loop tells us which letters we will change (by their indexes).

The third loop goes through all of the possible letter changes for those indexes. There's some logic to make sure we actually change the letter (changing "C" to "C" doesn't count).

import itertools

def generate_replacements(lo, hi, alphabet, text):
    for count in range(lo, hi + 1):
        for indexes in itertools.combinations(range(len(text)), count):
            for letters in itertools.product(alphabet, repeat=count):
                new_text = list(text)
                actual_count = 0
                for index, letter in zip(indexes, letters):
                    if new_text[index] == letter:
                        continue
                    new_text[index] = letter
                    actual_count += 1
                if actual_count == count:
                    yield ''.join(new_text)

for text in generate_replacements(0, 2, 'BGCX', 'BBBX'):
    print text

Here's its output:

BBBX GBBX CBBX XBBX BGBX BCBX BXBX BBGX BBCX BBXX BBBB BBBG BBBC GGBX
GCBX GXBX CGBX CCBX CXBX XGBX XCBX XXBX GBGX GBCX GBXX CBGX CBCX CBXX
XBGX XBCX XBXX GBBB GBBG GBBC CBBB CBBG CBBC XBBB XBBG XBBC BGGX BGCX
BGXX BCGX BCCX BCXX BXGX BXCX BXXX BGBB BGBG BGBC BCBB BCBG BCBC BXBB
BXBG BXBC BBGB BBGG BBGC BBCB BBCG BBCC BBXB BBXG BBXC

Upvotes: 2

Related Questions