Reputation: 7128
In the sample program given below (source: http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/)
// unordered_map::rehash
#include <iostream>
#include <string>
#include <unordered_map>
int main ()
{
std::unordered_map<std::string,std::string> mymap;
mymap.rehash(20);
mymap["house"] = "maison";
mymap["apple"] = "pomme";
mymap["tree"] = "arbre";
mymap["book"] = "livre";
mymap["door"] = "porte";
mymap["grapefruit"] = "pamplemousse";
std::cout << "current bucket_count: " << mymap.bucket_count() << std::endl;
return 0;
}
output becomes:
current bucket_count: 23
why the bucket count becomes 23?
What's the impact on heapsize? When does the heap allocation done? On bucket rehash or on actual insert? When the dynamic deallocation gets done? when clear()
is used or erase()
or both?
Upvotes: 2
Views: 7456
Reputation: 61459
Hash tables are usually sized to be "comfortably" larger than the number of items to be stored in the table. This is because the probability of two different items mapping to the same bucket increases as a hash table fills.
As the following image from wikipedia (image source) shows, for some methods of collision resolution, the behavior of the hash table suffers dramatically if its "load factor" --- the percentage of buckets which are used --- surpasses a certain fraction.
Therefore, the number of buckets should always be larger than the number of elements in your hash table.
Having the number of buckets be a prime number helps ensure that entries in the hash table are distributed evenly across it. In general, any key which shares a common factor with the number of buckets will be hashed to a bucket that is a multiple of this factor. Therefore, if you set the number of buckets to 20 and your hash values happen to be even, you'll waste about 50% of the table's space. The situation's worse if your keys have factors like 4, 5, or 10.
Knowing the above, you can see why the hash table might be larger than you specify: the extra space (usually) contributes to performance. You can also see why the number of bins will be a prime: because that gives better usage of the space you do have. Combining these makes 23 seem like a reasonable choice.
Upvotes: 4
Reputation: 643
This program demonstrates the bucket number growth depending on the number of inserted nodes.
#include <iostream>
#include <unordered_map>
#include <random>
#include <climits>
using namespace std;
int main()
{
mt19937_64 mt;
uniform_int_distribution<size_t> uid( 0, SIZE_MAX );
unordered_map<size_t, char> map;
map.max_load_factor( 1.0f );
for( size_t r = 0; r < 0x10000; ++r )
{
size_t bucketsBefore = map.bucket_count();
map.emplace( uid( mt ), 0 );
size_t bucketsAfter = map.bucket_count();
if( bucketsAfter == bucketsBefore )
continue;
cout << map.size() - 1 << " + 1: " << bucketsAfter << endl;
}
}
This are the results for MSVC:
8 + 1: 64
64 + 1: 512
512 + 1: 1024
1024 + 1: 2048
2048 + 1: 4096
4096 + 1: 8192
8192 + 1: 16384
16384 + 1: 32768
32768 + 1: 65536
This are the results for libstdc++ (g++) / libc++ (clang):
0 + 1: 13
13 + 1: 29
29 + 1: 59
59 + 1: 127
127 + 1: 257
257 + 1: 541
541 + 1: 1109
1109 + 1: 2357
2357 + 1: 5087
5087 + 1: 10273
10273 + 1: 20753
20753 + 1: 42043
42043 + 1: 85229
I don't believe what Brian says. It could be a lot of effort to calculate the next higher prime number for a cetrain lower bound. I guess libstdc++ and libc++ uses a table of fixed steps which are prime.
libstdc++ and libc++ don't calculate a real hash with integrals when using std::hash but return the input value. Maybe chosing a prime bucket count alleviates this stupid decision.
Upvotes: 0
Reputation: 119477
The default rehash policy used by libstdc++ is to go up to the smallest prime number of buckets greater than or equal to the number requested. 23 is the smallest prime above 20.
Upvotes: 5