Ajith
Ajith

Reputation: 1545

Python Performance of Permutation operation

I have a data frame that contains 2000 words. I am trying to create permutations of all the words with 4 at a time.

   perms = it.permutations(words.col, 4)
   done=object()
   nextperm=next(perms,done)
   while(nextperm is not done):
      #perform operations...
      permutedString=' '.join(nextperm )
      if(permutedString.count('a')==2 and permutedString.count('b')==3 and 
         permutedString.count('c')==1):
          //calculate  Md5 hash with 
          hash=hashlib.md5(permutedString.encode('utf-8')).hexdigest()
      nextperm=next(perms,done)

The above script is taking too long for me. Its been running for hours. Is there any way to improve performnce of this.

Any help on this is highly appreciated.

Upvotes: 0

Views: 202

Answers (2)

DSM
DSM

Reputation: 353499

As tzaman has pointed out, what's going on inside your loop is key to coming up with a better way to address the problem. The fundamental idea is that you want to minimize the amount of pairwise (or N-wise) operations you need to perform.

In your case, you're apparently trying to select phrases with the right number of specific letters, trying to crack some kind of "correct horse battery staple" password scheme with the benefit of constraints on the letter counts. In that case, since you're only allowed 1 letter 'c', there's no point in dealing with any permutation which has two 'c's. Etc.

We can do better than just ruling out useless words, too: we don't actually need to compare all words to see if they match, we can simply compare word count sets. That is, we can group all the words by their a, b, and c letter counts, which we can do in linear time, and then we can just iterate over four of those count sets to see if they sum to the right number of letters. Now we only have to do the count logic for elements drawn from a set of ~10s, not ~2000. (Really we could do even better than this, because we can either recursively or using partition techniques directly build the appropriate possible count sets, but let's start simple.)

Now, you've said "There will be only one or two strings that will match this condition", and I'm going to take you at your word, and limit the amount of optimization I'm going to do to handle a case like that.

If there are only a handful which satisfy the constraint, there must be only a few letter count groups which satisfy the constraint too, and not many words in that group. So something like this should work:

from collections import Counter
from itertools import permutations, product, combinations_with_replacement
import hashlib

# make a fake set of words
with open('/usr/share/dict/words') as fp:
    words = [word.lower() for word in fp.read().split()]
words = set(words[::len(words)//2000][:2000])

# set the target to something which has <2000 matching 4-words
target_counts = Counter({"a": 5, "b": 4, "d": 8})

# collect the words by counts
by_count = {}
for word in words:
    lcount = {letter: word.count(letter) for letter in target_counts}
    by_count.setdefault(tuple(sorted(lcount.items())), []).append(word)

collected_hashes = {}
# loop over every possible collection of word count groups
for i, groups in enumerate(combinations_with_replacement(by_count, 4)):
    if i % 10000 == 0:
        print(i, groups)

    # check to see whether the letter set sums appropriately
    total_count = sum((Counter(dict(group)) for group in groups), Counter())
    if total_count != target_counts:
        continue

    # the sums are right! loop over every word draw; for simplicity
    # we won't worry about duplicate word draws, we'll just skip
    # them if we see them
    for choices in product(*(by_count[group] for group in groups)):
        if len(set(choices)) != len(choices):
            # skip duplicate words
            continue
        for perm in permutations(choices):
            joined = ' '.join(perm)
            hashed = hashlib.md5(joined.encode("utf-8")).hexdigest()
            collected_hashes.setdefault(hashed, set()).add(joined)

which takes about 10 seconds to produce a dictionary of hashes, e.g.

In [25]: len(collected_hashes)
Out[25]: 960

In [26]: collected_hashes
Out[26]: 
{'67a0da67d938a6090beedcb849f66596': {'barbed badlands saddlebag skidded'},
 '8da55149bf66f89f27b658a7ef7d7126': {'barbed badlands skidded saddlebag'},
 '69dd0f1c7af2e9973fedcc585af15df4': {'barbed saddlebag badlands skidded'},
 '106a366ef772a5f68ebb4800b2281a09': {'barbed saddlebag skidded badlands'},
...

where each password indeed has the right number of target letter counts:

In [30]: c = Counter('barbed badlands saddlebag skidded')

In [31]: c['a'], c['b'], c['d']
Out[31]: (5, 4, 8)

Upvotes: 2

tzaman
tzaman

Reputation: 47850

Outside of the fact that P(2000, 4) is an enormous number, you don't need to manually check your loop with a sentinel, you can just iterate directly over the perms object. That's about as much optimization as can be expected on the code you've provided, whatever is happening inside the loop will be the determining factor.

perms = it.permutations(words.col, 4)
for perm in perms:
    # do stuff with perm

Upvotes: 0

Related Questions