Reputation: 321

all possible length k combinations of a string in python

I would like to get all possible combinations of the letters in a string with length k. I know there are many posts on this but I have a little twist k is larger than the length of the string.

This is what I have so far, its simple and works if k <= len(string):

 string = 'ABCD'
 permutations = ["".join(x) for x in itertools.permutations(string, k)]

results if k = 4:

 ['ABCD', 'ABDC', 'ACBD', 'ACDB', 'ADBC', 'ADCB', 'BACD', 'BADC', 'BCAD', 'BCDA', 
'BDAC','BDCA', 'CABD', 'CADB', 'CBAD', 'CBDA', 'CDAB', 'CDBA', 'DABC', 'DACB', 
'DBAC', 'DBCA', 'DCAB', 'DCBA']

This works as expected. However, I would like all possible combinations of these four letters with k > len(string).

An example answer I would like would be:

string = 'AB'
k = 4
result = ['AAA,'ABB','AAB', 'ABA','BBB', 'BAA'.......]

Thanks in advance.

Upvotes: 1

Answers (3)

Steve Jessop

Reputation: 279265

Based on your comment:

I'm trying to search a very large string for the number of occurrences of each combination and see which combination occurs the most often.

There's another way to do what you want:

def substrings(vlarge, k):
    return (vlarge[idx:idx+k] for idx in range(len(vlarge)-k+1))

def uses_only(value, chars):
    return all(ch in chars for ch in value)

def most_common(vlarge, chars, k):
    return collections.Counter(s for s in substrings(vlarge, k) if uses_only(s, chars)).most_common(1)

You can then look at making this basic idea more efficient: for example if you encounter an 'x' character in vlarge then you know that none of the substrings that include it are combinations of 'abcd'. So you can skip ahead to the substring that starts one place after the x:

def generate_substrings(vlarge, chars, k):
    idx = 0
    goodrange = 0
    while idx <= len(vlarge) - k:
        while goodrange < idx + k:
            if vlarge[goodrange] in chars:
                goodrange += 1
            else:
                idx = goodrange + 1
                if idx > len(vlarge) - k:
                    return
                goodrange = idx
        yield vlarge[idx:goodrange]
        idx += 1

def most_common(vlarge, chars, k):
    return collections.Counter(generate_substrings(vlarge, chars, k)).most_common(1)

Compared with this approach, the "obvious" idea (iterate over all combinations, counting how many times they appear as a substring, and keeping track of the best so far) uses less memory but will be a lot slower, since it has to make a lot of passes over the very large string.

If I have misunderstood what you mean by "combinations", that is to say if my uses_only function is incorrect, then you'd have to adjust my idea accordingly. The point is: count the actual substrings of the form you want, because there are fewer of them than there are hypothetical substrings of the correct form.

Upvotes: 3

Rerito

Reputation: 6086

My answer will only be a theoretical analysis of what you are doing. I will denote the binomial coefficient defining the number of parts containing k elements of a set containing n elements by C(k,n).

Suppose you have a string of length n ∈ ℕ^* and k ∈ ℕ, k ⩾ n. I will assume all the characters in your string are distinct.
I understood that you are trying to build a string of k characters extracted from your input string.

A combination of your string characters can be seen as a permutation of ⟦1, n⟧. There are n! such permutations ...

Then, when k > n, things are getting much worse ... Let r = k mod n and p = (k - r)/n. Obviously we have :

p ⩾ 1
0 ⩽ r < p

Your output string can there be decomposed in p "complete" substrings made out of a permutation of your n input characters and one substring made out of only r characters of your input string.

To build such an "incomplete" substring, you first have to choose a subset of r characters of your input string and then a permutation of such characters. Finally, the number s_r,n of such possible substrings is :

s_r,n = C(r,n).r!

Note that this formula won't lead to an invalid global result when r = 0, since C(0,n) = 1 and 0! = 1 by convention.

The final number of k-long strings you can build according to your scheme is :

s_tot = (C(r,n).r!).(n!)^p

This number is outrageously high !

With k = 4 and n = 2, we have :

s_tot = (C(0,4).0!).(2!)² = 4

result = ['ABAB', 'ABBA', 'BAAB', 'BABA']

Upvotes: 1

Tim Peters

Reputation: 70602

You may well want

itertools.product(string, repeat=k)

instead. Try it! Your description is ill-defined, so can't guess for sure.

Example:

>>> import itertools
>>> for p in itertools.product("ab", repeat=3):
...     print p
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'b', 'a')
('a', 'b', 'b')
('b', 'a', 'a')
('b', 'a', 'b')
('b', 'b', 'a')
('b', 'b', 'b')

Upvotes: 9

all possible length k combinations of a string in python

Answers (3)

Related Questions