Rajan
Rajan

Reputation: 43

Permutations with Order

I am trying to write a Python function that performs a function similar to itertools.permutation.

import itertools
for s in itertools.permutations("TCGA****")
    print s

The ideal output from such a function would be

('*','*','*','*','T', 'C','G','A')
('*','*','*','T','*', 'C','G','A')
('*','*','*','T','C', '*','G','A')
('*','*','*','T','C', 'G','*','A')
('*','*','*','T','C', 'G','A','*')
('*','*','T','C','G', 'A','*','*')
('*','*','T','C','G', '*','*','A')
('*','*','T','C','*', '*','G','A')
...
('T', 'C','G','A','*','*','*','*')

The only difference between itertools.permutation and this function is that the order is maintained i.e. 'T' always precedes 'C' which precedes 'G' which precedes 'A'.

The following is an example that violates this rule

('*','*','T','*','G','C','A','*','*')

The order of 'C' and 'G' has changed.

How can I produce the permutations for which the order 'TCGA' is maintained among the asterisks?

Upvotes: 4

Views: 1628

Answers (2)

miradulo
miradulo

Reputation: 29690

One idea would be to produce all the possible indices for your '*' values with itertools.combinations on your list index range, and then construct each possible permutation from those indices, filling with your 'TCGA' values accordingly for the indices not found in each combination.

Since you are assured to use all of TCGA in each iteration, itertools.cycle is one way to continually get the appropriate value for the next position. Here perms is implemented as a generator to allow for lazy evaluation.

from itertools import combinations, cycle

char_cyc = cycle('TCGA')
combos = combinations(range(8), 4)

perms = (['*' if i in combo else next(char_cyc) for i in range(8)]
         for combo in combos)

print(list(perms))

Outputs:

[['*', '*', '*', '*', 'T', 'C', 'G', 'A'], ['*', '*', '*', 'T', '*', 'C', 'G', 'A'], ['*', '*', '*', 'T', 'C', '*', 'G', 'A'], ['*', '*', '*', 'T', 'C', 'G', '*', 'A'], ['*', '*', '*', 'T', 'C', 'G', 'A', '*'], ['*', '*', 'T', '*', '*', 'C', 'G', 'A'], ['*', '*', 'T', '*', 'C', '*', 'G', 'A'], ['*', '*', 'T', '*', 'C', 'G', '*', 'A'], ['*', '*', 'T', '*', 'C', 'G', 'A', '*'], ['*', '*', 'T', 'C', '*', '*', 'G', 'A'], ['*', '*', 'T', 'C', '*', 'G', '*', 'A'], ['*', '*', 'T', 'C', '*', 'G', 'A', '*'], ['*', '*', 'T', 'C', 'G', '*', '*', 'A'], ['*', '*', 'T', 'C', 'G', '*', 'A', '*'], ['*', '*', 'T', 'C', 'G', 'A', '*', '*'], ['*', 'T', '*', '*', '*', 'C', 'G', 'A'], ['*', 'T', '*', '*', 'C', '*', 'G', 'A'], ['*', 'T', '*', '*', 'C', 'G', '*', 'A'], ['*', 'T', '*', '*', 'C', 'G', 'A', '*'], ['*', 'T', '*', 'C', '*', '*', 'G', 'A'], ['*', 'T', '*', 'C', '*', 'G', '*', 'A'], ['*', 'T', '*', 'C', '*', 'G', 'A', '*'], ['*', 'T', '*', 'C', 'G', '*', '*', 'A'], ['*', 'T', '*', 'C', 'G', '*', 'A', '*'], ['*', 'T', '*', 'C', 'G', 'A', '*', '*'], ['*', 'T', 'C', '*', '*', '*', 'G', 'A'], ['*', 'T', 'C', '*', '*', 'G', '*', 'A'], ['*', 'T', 'C', '*', '*', 'G', 'A', '*'], ['*', 'T', 'C', '*', 'G', '*', '*', 'A'], ['*', 'T', 'C', '*', 'G', '*', 'A', '*'], ['*', 'T', 'C', '*', 'G', 'A', '*', '*'], ['*', 'T', 'C', 'G', '*', '*', '*', 'A'], ['*', 'T', 'C', 'G', '*', '*', 'A', '*'], ['*', 'T', 'C', 'G', '*', 'A', '*', '*'], ['*', 'T', 'C', 'G', 'A', '*', '*', '*'], ['T', '*', '*', '*', '*', 'C', 'G', 'A'], ['T', '*', '*', '*', 'C', '*', 'G', 'A'], ['T', '*', '*', '*', 'C', 'G', '*', 'A'], ['T', '*', '*', '*', 'C', 'G', 'A', '*'], ['T', '*', '*', 'C', '*', '*', 'G', 'A'], ['T', '*', '*', 'C', '*', 'G', '*', 'A'], ['T', '*', '*', 'C', '*', 'G', 'A', '*'], ['T', '*', '*', 'C', 'G', '*', '*', 'A'], ['T', '*', '*', 'C', 'G', '*', 'A', '*'], ['T', '*', '*', 'C', 'G', 'A', '*', '*'], ['T', '*', 'C', '*', '*', '*', 'G', 'A'], ['T', '*', 'C', '*', '*', 'G', '*', 'A'], ['T', '*', 'C', '*', '*', 'G', 'A', '*'], ['T', '*', 'C', '*', 'G', '*', '*', 'A'], ['T', '*', 'C', '*', 'G', '*', 'A', '*'], ['T', '*', 'C', '*', 'G', 'A', '*', '*'], ['T', '*', 'C', 'G', '*', '*', '*', 'A'], ['T', '*', 'C', 'G', '*', '*', 'A', '*'], ['T', '*', 'C', 'G', '*', 'A', '*', '*'], ['T', '*', 'C', 'G', 'A', '*', '*', '*'], ['T', 'C', '*', '*', '*', '*', 'G', 'A'], ['T', 'C', '*', '*', '*', 'G', '*', 'A'], ['T', 'C', '*', '*', '*', 'G', 'A', '*'], ['T', 'C', '*', '*', 'G', '*', '*', 'A'], ['T', 'C', '*', '*', 'G', '*', 'A', '*'], ['T', 'C', '*', '*', 'G', 'A', '*', '*'], ['T', 'C', '*', 'G', '*', '*', '*', 'A'], ['T', 'C', '*', 'G', '*', '*', 'A', '*'], ['T', 'C', '*', 'G', '*', 'A', '*', '*'], ['T', 'C', '*', 'G', 'A', '*', '*', '*'], ['T', 'C', 'G', '*', '*', '*', '*', 'A'], ['T', 'C', 'G', '*', '*', '*', 'A', '*'], ['T', 'C', 'G', '*', '*', 'A', '*', '*'], ['T', 'C', 'G', '*', 'A', '*', '*', '*'], ['T', 'C', 'G', 'A', '*', '*', '*', '*']]

A good indication that is output is correct is the fact that the length of perms is 70, which is equal to 8C4 (or "8 choose 4"), which is effectively what your problem concerns.

Upvotes: 6

julienc
julienc

Reputation: 20305

My solution is much less efficient than Mitch's, but it is another way to solve the problem, so it might interest you as well.

Here is my approach: generate all the possible permutations of "****XXXX" (40320 exactly), then, for each resulting permutation, replace each "X" by the corresponding value in "TGCA" in the wanted order. The flaw here is that there won't be 40320 distinct patterns, but only 70, which means:

  • we'll have to execute the "for" loop 40320 times when 70 would have been enough
  • we'll have to store the generated permutations in order to ignore the duplicates

But as I said, it's another way of seeing the problem.

>>> import itertools
>>> already_seen_permutations = set()
>>> for s in itertools.permutations("****XXXX"):
...     if s in already_seen_permutations:
...         continue  # duplicate permutation, just ignore it
...     already_seen_permutations.add(s)
...     # time to insert TCGA correctly
...     s = tuple("".join(s).replace("X", "T", 1).replace("X", "C", 1).replace("X", "G", 1).replace("X", "A", 1))
...     print(s)

On my computer, it takes roughly one second to execute the code 100 times. In term of performance, it's approximately the same than generating all the permutations of "****TCGA" and ignoring the ones that do not follow the "TCGA" order.

Upvotes: 1

Related Questions