juan garcia
juan garcia

Reputation: 1526

Hash string function djb2 avoiding collisions properly?

I am studying about hash string functions for a hasmap in javascript. And looking into this code that I found in the web, I am not sure if the function is correct or not:

HashMap._hashString = function(string) {
    var hash = 5381;
    for (var i=0; i<string.length; i++) {
        hash = (hash << 5) + hash + string.charCodeAt(i);
        hash = hash & hash;
    }
    //Reduce the chance of collisions by incorporating the string length,
    //and randomize the hashes to prevent malicious collisions.
    return hash ^ string.length ^ this._secret;
};

Does it make any sense to have this line?

        hash = hash & hash;

In this line of code:

    return hash ^ string.length ^ this._secret;

I understand that adding the length of the string as a factor for the hash to evaluate would help to work with the collisions, but why would I add this factor with a XOR operation? Why not using any other bit operator?

I am also reading about this article, to understand a bit more about the hash algorithms:

http://www.cse.yorku.ca/~oz/hash.html

Upvotes: 0

Views: 1286

Answers (1)

trincot
trincot

Reputation: 350831

Does it make any sense to have this line?

   hash = hash & hash;

The purpose of that line is to limit the value to the 32-bit range. hash & hash looks like a no-op, but applying bitwise operators will clip any overflow. It gives the same result as this:

 hash = hash & 0xFFFFFFFF

In this line of code:

return hash ^ string.length ^ this._secret;

I understand that adding the length of the string as a factor for the hash to evaluate would help to work with the collisions, but why would I add this factor with a XOR operation? Why not using any other bit operator?

With & or | you would lose information: different inputs of the same length would have a bit higher chance of collision. In particular, an & with a length that is a power of 2, would be disastrous, as it could only yield 2 different values (the length itself or zero). Or an | with a length that has mostly 1 bits (like 0xffff): this would again limit the possible outcomes.

Doing a + would be a viable alternative, but then you would want to make sure the result remains in the 32-bit range again.

Upvotes: 1

Related Questions