Reputation: 6340
If we know that we're going to hash between m
and n
items, where m
and n
are relatively large, what's a reasonable strategy for setting the number of initial buckets for std::unordered_set
? If it helps, in my case m=n/2
. In general, I'd like to optimize for speed, but can't afford an unreasonable amount of memory. Thanks in advance.
Upvotes: 1
Views: 1373
Reputation: 2703
As an alternative, if you can live with logarithmic performance (usually not a problem), use a std::map instead. Then you have guaranteed lookup complexity 100% of the time, no re-hashing. A very useful property e.g. in hard real-time systems.
Upvotes: 0
Reputation: 171263
tl;dr There's no simple answer. Either measure, or let the container manage the bucket size automatically.
As I tried to say in the comments, there are too many variables, and you don't seem to realise how vague you're being. It took an hour for you to even say which implementation you're interested in.
m
and n
are "relatively large" ... relative to what?
"These are the only two operations and I want them to be fast." Define fast? What's fast enough? What's too slow? Have you measured?
If you want to minimize the load factor, so that there is on average no more than one element per bucket (and so no iteration through buckets needed once that the right bucket is known) then you'll need at least n
buckets. But that doesn't guarantee one bucket per element, because the function used to determine the bucket from a hash code might return the same value for every pointer you put in the container. Knowing if that's likely depends on the hash function being used, and the function that maps hash codes to buckets, and the pointer values themselves.
For GCC the hash function for pointers is the identity function. For the default unordered_map
implementation the mapping to buckets is hash_function(x) % bucket_count()
and the bucket count is always a prime number, to reduce the likelihood of collisions. If the addresses you're storing in the hash map tend to be separated by multiples of the bucket count then they're going to end up in the same bucket. Knowing how likely that is depends on the number of buckets used for n
(which you haven't stated) and the distribution of pointer values you're using (which you haven't stated).
If you use a custom hash function that has knowledge of the pointer values you expect to store then you could use a perfect hash function that uniformly distributes between [0, n)
and then set the bucket_count()
to n
and ensure no collisions.
But it's not obvious that ensuring only a single element per bucket is worth it, because it uses more memory. Iterating through a bucket containing two or three elements is not going to be a bottleneck in most programs. Maybe it will be in yours, it's impossible to know because you haven't said what you want except it has to be fast. Which is so vague it's meaningless.
The only way to answer these questions is for you to measure the real world performance, nobody can give you a magic number that will make your code faster based on your vague requirements. If there was an easy answer that always makes things faster for "relatively large" number of elements then the standard library implementation should already be doing that and so you'd just be wasting your time doing the same thing by hand.
Upvotes: 2