Allanqunzi
Allanqunzi

Reputation: 3260

unordered_map, why some buckets have more than 1 element even hashed values are all different?

I am playing with default hash function for std::unordered_map and checking if there are collisions happening. Here below is the code:

void check(std::map<long, bool> & mp, const string & s, const long v){
    if(mp.count(v) == 0){
        mp[v] = true;
    }else{
        cout<<"collision for "<<s<<endl;
    }
}
int main(){
    std::unordered_map<std::string, long> um;
    std::map<long, bool> map;
    auto hasher = um.hash_function();

    std::string str = "";

long count = 0;

    for (int i = 0; i < 26; ++i)
    {
        str = char(65+i);
        um[str];
        auto value = hasher(str);
        check(map, str, value);

        for (int j = 0; j < 26; ++j)
        {
            str = char(65+i);
            str += char(65+j);
            um[str];
            auto value = hasher(str);
            check(map, str, value);

            for (int k = 0; k < 26; ++k)
            {
                str = char(65+i);
                str += char(65+j);
                str += char(65+k);
                um[str];
                auto value = hasher(str);
                check(map, str, value);

                for (int l = 0; l < 26; ++l)
                {
                    str = char(65+i);
                    str += char(65+j);
                    str += char(65+k);
                    str += char(65+l);
                    um[str];
                    auto value = hasher(str);
                    check(map, str, value);

                    for (int m = 0; m < 26; ++m)
                    {
                        str = char(65+i);
                        str += char(65+j);
                        str += char(65+k);
                        str += char(65+l);
                        str += char(65+m);
                        um[str];
                        auto value = hasher(str);
                        check(map, str, value);
                    }
                }
            }
        }
    }
        for(int i = 0; i < um.bucket_count(); ++i){
        if(um.bucket_size(i) >= 2)cout<<"um_collison happened at "<<i<<"  "<<um.bucket_size(i)<<endl;
    }

    return 0;
}

The problem I have is I found that, there is no output for collision for, but there are plenty of outputs like um_collison happened at 17961055 3, I am using g++ 4.9.2, why the strings are hashed to different values but some buckets still have more than 1 element?

Upvotes: 0

Views: 524

Answers (1)

Chris Beck
Chris Beck

Reputation: 16224

The value returned by the hasher function is a size_t. The hasher function is assumed to give at least weakly pseudorandom (i.e. "pairwise independent" or similar) output when given a random input. However, the output of the hasher function is not used directly as the bin number of a given input. That would require that std::unordered_map has 2^64 bins on a 64 bit machine... which is way too much memory.

Instead, based on the current number of elements, there are a certain number of bins, say B. The map will take the output of the hasher, modulo B, usually, to map the items to bins. Thus items with different hashes can go to the same bin. Once the hash table gets too tightly packed, depending on some heuristics, usually B will be increased and the items will get rehashed to reduce the number of collisions. That is, we recompute all the hashes and then take them modulo B' instead for some larger B'. But how exactly it works is an implementation detail.

Upvotes: 3

Related Questions