Reputation: 305
I'm trying to write a script to calculate all of the possible fuzzy string match matches to for a short string, or 'kmer', and the same code that works in Python 2.7.X gives me a non-deterministic answer with Python 3.3.X, and I can't figure out why.
I iterate over a dictionary, itertools.product, and itertools.combinations in my code, but I iterate over all of them to completion with no breaks or continues. In addition, I store all of my results in a separate dictionary instead of the one I'm iterating over. In short - I'm not making any mistakes that are obvious to me, so why is the behavior different between Python2 and Python3?
Sample, slightly simplified code below:
import itertools
def find_best_fuzzy_kmer( kmers ):
for kmer, value in kmers.items():
for similar_kmer in permute_string( kmer, m ):
# Tabulate Kmer
def permute_string( query, m ):
query_list = list(query)
output = set() # hold output
for i in range(m+1):
# pre-calculate the possible combinations of new bases
base_combinations = list(itertools.product('AGCT', repeat=i))
# for each combination `idx` in idxs, replace str[idx]
for positions in itertools.combinations(range(len(query_list)), i):
for bases in base_combinations:
# Generate Permutations and add to output
return output
Upvotes: 22
Views: 2134
Reputation: 70725
If by "non-deterministic" you mean the order in which dictionary keys appear (when you iterate over a dictionary) changes from run to run, and the dictionary keys are strings, please say so. Then I can help. But so far you haven't said any of that ;-)
Assuming that's the problem, here's a little program:
d = dict((L, i) for i, L in enumerate('abcd'))
print(d)
and the output from 4 runs under Python 3.3.2:
{'d': 3, 'a': 0, 'c': 2, 'b': 1}
{'d': 3, 'b': 1, 'c': 2, 'a': 0}
{'d': 3, 'a': 0, 'b': 1, 'c': 2}
{'a': 0, 'b': 1, 'c': 2, 'd': 3}
The cause is hinted at from this part of python -h
output:
Other environment variables:
...
PYTHONHASHSEED: if this variable is set to 'random', a random value is used
to seed the hashes of str, bytes and datetime objects. It can also be
set to an integer in the range [0,4294967295] to get hash values with a
predictable seed.
This is a half-baked "security fix", intended to help prevent DOS attacks based on constructing dict inputs designed to provoke quadratic-time behavior. "random" is the default in Python3.
You can turn that off by setting the envar PYTHONHASHSEED to an integer (your choice - pick 0 if you don't care). Then iterating a dict with string keys will produce them in the same order across runs.
As @AlcariTheMad said in a comment, you can enable the Python3 default behavior under Python 2 via python -R ...
.
Upvotes: 35