Jim
Jim

Reputation: 19582

Why is the hash table resized by doubling it?

Checking in java and googling online for hashtable code examples it seems that the resizing of the table is done by doubling it.
But most textbooks say that the best size for the table is a prime number.
So my question is:
Is the approach of doubling because:

  1. It is easy to implement, or
  2. Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)
  3. Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Update:
The way presented in textbooks using a prime number is required for certain properties to work (e.g. quadratic probing needs a prime size table to prove that e.g. if a table is not full item X will be inserted).
The link posted as duplicate asks generally about increasing by any number e.g. 25% or next prime and the answer accepted states that we double in order to keep the resizing operation "rare" so we can guarantee amortized time.
This does not answer the question of having a table size that is prime and using a prime for resizing such that is even greater than double. So the idea is to keep the properties of the prime size taking into account the resizing overhead

Upvotes: 11

Views: 3906

Answers (2)

Persixty
Persixty

Reputation: 8589

Java HashMap (java.util.HashMap) chains bucket collisions in a linked list (or [as of JDK8] tree depending on the size and overfilling of bins).

Consequently theories about secondary probing functions don't apply. It seems the message 'use primes sizes for hash tables' has become detached from the circumstances it applies over the years...

Using powers of two has the advantage (as noted by other answers) of reducing the hash-value to a table entry can be achieved by a bit-mask. Integer division is relatively expensive and in high performance situations this can help.

I'm going to observe that "redistributing the collision chains when rehashing is a cinch for tables that are a power of two going to a power of two".

Notice that when using powers of two rehashing to twice the size 'splits' each bucket between two buckets based on the 'next' bit of the hash-code. That is, if the hash-table had 256 buckets and so using the lowest 8 bits of the hash-value rehashing splits each collision chain based on the 9th bit and either remains in the same bucket B (9th bit is 0) or goes to bucket B+256 (9th bit is 1). Such splitting can preserve/take advantage of the bucket handling approach. For example, java.util.HashMap keeps small buckets sorted in reverse order of insertion and then splits them into two sub-structures obeying that order. It keeps big buckets in a binary tree sorted by hash-code and similarly splits the tree to preserve that order.

NB: These tricks weren't implemented until JDK8.

(I am pretty sure) Java.util.HashMap only sizes up (never down). But there are similar efficiencies to halving a hash-table as doubling it.

One 'downside' of this strategy is that Object implementers aren't explicitly required to make sure the low order bits of hash-codes are well distributed. A perfectly valid hash-code could be well distributed overall but poorly distributed in its low order bits. So an object obeying the general contract for hashCode() might still tank when actually used in a HashMap! Java.util.HashMap mitigates this by applying an additional hash 'spread' onto the provided hashCode() implementation. That 'spread' is really quick crude (xors the 16 high bits with the low).

Object implmenters should be aware (if not already) that bias in their hash-code (or lack thereof) can have a significant effect on the performance of data structures using hashes.

For the record I've based this analysis on this copy of the source:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

Upvotes: 3

leventov
leventov

Reputation: 15313

Q: But most textbooks say that the best size for the table is a prime number.

Regarding size primality:

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.


Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?

  1. Quality of hash function doesn't matter, you always can "improve" hash function by MurMur3 avalancing, that is cheaper than switching to prime table size from power-of-2 table size, see above.

  2. I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads. With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5. See this answer.

    Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.

Q: Is the approach of doubling because:
It is easy to implement, or

Basically, in many cases, yes. See this large answer regarding load factors:

Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).

Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.

What is doubling? It's just the simpliest resizing strategy. The strategy could be arbitrarily complex, performing optimally in your use cases. It could consider present hash table size, growth intensity (how much get operations were done since previous resize), etc. Nobody forbids you to implement such custom resizing logic.

Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)

There is a good practice to precompute some subset of prime hash table sizes, to choose between them using binary search in runtime. See the list double hash capacities and explaination, QHash capacities. Or, even using direct lookup, that is very fast.

Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Yes, only certain types requre, see above.

Upvotes: 6

Related Questions