Scott Bartell
Scott Bartell

Reputation: 2840

Is it faster to loop through a Python set of number or a set of letters?

Is it faster to loop through a Python set of numbers or a Python set of letters given that each set is the exact same length and each item within each set is the same length? Why?

I would think that there would be a difference because letters have more possible characters [a-zA-Z] than numbers [0-9] and therefor would be more 'random' and likely affect the hashing to some extent.

numbers = set([00000,00001,00002,00003,00004,00005, ... 99999])

letters = set(['aaaaa','aaaab','aaaac','aaaad', ... 'aaabZZ']) # this is just an example, it does not actually end here

for item in numbers:
  do_something()

for item in letters:
  do_something()

where len(numbers) == len(letters)

Update: I am interested in Python's specific hashing algorithm and what happens behind the scenes with this implementation.

Upvotes: 1

Views: 367

Answers (2)

Steve Jessop
Steve Jessop

Reputation: 279225

There might be some particular implementation details of Python that I'm not aware of that mess with my general arguments here, but:

  • Creating the set of strings will likely be a little slower than creating the set of integers (all else being equal), since the hash operation on strings takes some (small) time to run while the hash operation on integers is trivial.
  • Iterating a set doesn't perform any hash operations, so the time to hash is irrelevant there.
  • Iterating a set depends on the number of elements in the set and the number of buckets in the hash table backing the set. So the distribution of the hash function only matters if it causes the hash table to increase the bucket count. For some hash table implementations that's impossible (because bucket count is only increased when the load factor exceeds a threshold, not just because of collisions). Other hash table implementations resize when there are a lot of collisions. I don't know which CPython is.
  • Anyway, the particular example you give of a set of integers will generate well-distributed hash values.
  • There's one way to find out which is faster in Python, which is to timeit, with a realistic example of the data you care about. Speculation is usually a waste of time.

You can see the results of Python's hashing algorithms like this:

>>> foo = 3
>>> foo.__hash__()
3
>>> foo = 1856348
>>> foo.__hash__()
1856348
>>> foo = "\x00"
>>> foo.__hash__()
1
>>> foo = "\x01"
>>> foo.__hash__()
128000384
>>> foo = "\x02"
>>> foo.__hash__()
256000771

So on my copy of Python, those hash results match these reported Python hash algorithms. As ever with CPython, you can look at the source to confirm the algorithm.

Upvotes: 7

Nolen Royalty
Nolen Royalty

Reputation: 18633

You can't know until you profile! Here's some crude data:

% python -mtimeit -s 'import string;import random;s=set("".join(random.sample(string.printable, 5))for _ in range(10000))' 'for foo in s: bar=foo'
1000 loops, best of 3: 376 usec per loop

% python -mtimeit -s 'import random;s=set(int("".join(map(str,random.sample(range(10),5))))for _ in range(10000))' 'for foo in s: bar=foo'                     
1000 loops, best of 3: 322 usec per loop

It looks like you're correct that there is a slight difference, but you'd need to do a lot more testing to say much with confidence.

Upvotes: 2

Related Questions