Reputation: 321
I would like to get all possible combinations of the letters in a string with length k. I know there are many posts on this but I have a little twist k is larger than the length of the string.
This is what I have so far, its simple and works if k <= len(string):
string = 'ABCD'
permutations = ["".join(x) for x in itertools.permutations(string, k)]
results if k = 4:
['ABCD', 'ABDC', 'ACBD', 'ACDB', 'ADBC', 'ADCB', 'BACD', 'BADC', 'BCAD', 'BCDA',
'BDAC','BDCA', 'CABD', 'CADB', 'CBAD', 'CBDA', 'CDAB', 'CDBA', 'DABC', 'DACB',
'DBAC', 'DBCA', 'DCAB', 'DCBA']
This works as expected. However, I would like all possible combinations of these four letters with k > len(string).
An example answer I would like would be:
string = 'AB'
k = 4
result = ['AAA,'ABB','AAB', 'ABA','BBB', 'BAA'.......]
Thanks in advance.
Upvotes: 1
Views: 10086
Reputation: 279265
Based on your comment:
I'm trying to search a very large string for the number of occurrences of each combination and see which combination occurs the most often.
There's another way to do what you want:
def substrings(vlarge, k):
return (vlarge[idx:idx+k] for idx in range(len(vlarge)-k+1))
def uses_only(value, chars):
return all(ch in chars for ch in value)
def most_common(vlarge, chars, k):
return collections.Counter(s for s in substrings(vlarge, k) if uses_only(s, chars)).most_common(1)
You can then look at making this basic idea more efficient: for example if you encounter an 'x'
character in vlarge
then you know that none of the substrings that include it are combinations of 'abcd'
. So you can skip ahead to the substring that starts one place after the x
:
def generate_substrings(vlarge, chars, k):
idx = 0
goodrange = 0
while idx <= len(vlarge) - k:
while goodrange < idx + k:
if vlarge[goodrange] in chars:
goodrange += 1
else:
idx = goodrange + 1
if idx > len(vlarge) - k:
return
goodrange = idx
yield vlarge[idx:goodrange]
idx += 1
def most_common(vlarge, chars, k):
return collections.Counter(generate_substrings(vlarge, chars, k)).most_common(1)
Compared with this approach, the "obvious" idea (iterate over all combinations, counting how many times they appear as a substring, and keeping track of the best so far) uses less memory but will be a lot slower, since it has to make a lot of passes over the very large string.
If I have misunderstood what you mean by "combinations", that is to say if my uses_only
function is incorrect, then you'd have to adjust my idea accordingly. The point is: count the actual substrings of the form you want, because there are fewer of them than there are hypothetical substrings of the correct form.
Upvotes: 3
Reputation: 6086
My answer will only be a theoretical analysis of what you are doing. I will denote the binomial coefficient defining the number of parts containing k elements of a set containing n elements by C(k,n).
Suppose you have a string of length n ∈ ℕ* and k ∈ ℕ, k ⩾ n. I will assume all the characters in your string are distinct.
I understood that you are trying to build a string of k characters extracted from your input string.
A combination of your string characters can be seen as a permutation of ⟦1, n⟧. There are n! such permutations ...
Then, when k > n, things are getting much worse ... Let r = k mod n and p = (k - r)/n. Obviously we have :
p ⩾ 1
0 ⩽ r < p
Your output string can there be decomposed in p "complete" substrings made out of a permutation of your n input characters and one substring made out of only r characters of your input string.
To build such an "incomplete" substring, you first have to choose a subset of r characters of your input string and then a permutation of such characters. Finally, the number sr,n of such possible substrings is :
sr,n = C(r,n).r!
Note that this formula won't lead to an invalid global result when r = 0, since C(0,n) = 1 and 0! = 1 by convention.
The final number of k-long strings you can build according to your scheme is :
stot = (C(r,n).r!).(n!)p
This number is outrageously high !
With k = 4 and n = 2, we have :
stot = (C(0,4).0!).(2!)2 = 4
result = ['ABAB', 'ABBA', 'BAAB', 'BABA']
Upvotes: 1
Reputation: 70602
You may well want
itertools.product(string, repeat=k)
instead. Try it! Your description is ill-defined, so can't guess for sure.
Example:
>>> import itertools
>>> for p in itertools.product("ab", repeat=3):
... print p
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'b', 'a')
('a', 'b', 'b')
('b', 'a', 'a')
('b', 'a', 'b')
('b', 'b', 'a')
('b', 'b', 'b')
Upvotes: 9