Reputation: 2827
In the constructor of unordered_map
, we can define the allocated bucket count. I had thought I can use to reduce rehashing times. However, this could also hurt performance at some case. The rehash happens at insert when
Rehashing occurs only if the new number of elements is greater than
max_load_factor()*bucket_count()
. If the insertion is successful, pointers and references to the element obtained while it is held in the node handle are invalidated, and pointers and references obtained to that element before it was extracted become valid. (since C++17)
The above doc is from std::unordered_map
. I guess boost is similar? But its doc does not state the condition of rehashing.
If I initialize the bucket count to 100, and there is a bucket that contains all 100 elements, then rehashing would not happen until the 101 element is inserted... If I use the default bucket count, I assume it is << 100, rehashing can happen much earlier.
If so, when do we want to initialize bucket count?
Upvotes: 2
Views: 813
Reputation: 61569
A good rule of thumb is that a hash table should only ever be about 70% full (70% is the load factor). This results in some collisions, but not too many.
If you know ahead of time that the number of items you plan to put in your table is N
then setting the number of buckets to ((int)N/0.7)+1
might be a good value to choose as this avoids the need to rehash. If you're experimenting with the load factor, you'd want to use ((int)N/load_factor)+1
.
Making the table too large will probably not affect speed much: the cost of memory allocation does not depend heavily on how much memory you are allocating and, above a certain size, all tables will have poor cache performance for random accesses.
Upvotes: 2
Reputation: 50111
If so, when do we want to initialize bucket count?
When profiling shows it helps.
A more specific advice cannot be given as this depends both on the exact data as well as the hash function in use.
As usually, if the default is fast enough, just use that.
Upvotes: 2