Understanding of hash tables and collision detection

Question

Below is my implementation of a hash table using "buckets" for collision detection. I'm trying to make sure I can fully understand the logic behind hash tables and visualize it. This is what a hash table looks like in my case:

[[[]<----tuple]<---bucket, []<----bucket]<--storage

The key value pairs in the tuples are placed in the bucket based upon the hashing function's output, i.e., the bucket index. Once you are at the hashed bucket index, you place the key value pair there, inside the bucket. if anything matches that exact key (once inside the bucket) it gets overwritten in my implementation. The collision detection takes place when you find the same key as before, you overwrite its value.

Could I have done something different—perhaps added the key with a different value to the end of the tuple (instead of overwriting the value) or must the keys always be unique? Is there ever a case when only the values need to be unique?

    var makeHashTable = function() {



     var max = 4;

      return {
        _storage: [],
    retrieve: function(key) {
      //when we retrieve, we want to access the bucket in the same way as insert, but we don't need to set it to storage since that is already taken
      //care of in the insert function.  if there's nothing in the bucket, the loop won't run.
      //this function will return null by default.
      var bucketIndex = hashFn(key, max);
      var bucket = this._storage[bucketIndex];

      for (var i = 0; i < bucket.length; i++) {
        var tuple = bucket[i];
        if (tuple[0] === key) {
          return tuple[1];
        };
      };
      return null;
    },

    insert: function(key, value) {
      //hash function gives you the right index
      var bucketIndex = hashFn(key, max)
        //where you need to put the bucket. if there's no bucket, initialize it.
      var bucket = this._storage[bucketIndex] || [];
      //now you need to actually store the bucket there. 
      this._storage[bucketIndex] = bucket;
      //implement a collission detection scheme whereby you overwrite the respective value if the key matches, otherwise, add it to the end. 
      // the for loop won't execute if there is nothing in the bucket if so, jump to line 45 instead
      // here is what's happening. If the key doesn't already exist, the key value pair gets added to the end of the bucket.
      // if the key matches, IT MUST BE THE SAME VALUE that the hashed key associated with that value previously, so, overwrite it. 
      for (var i = 0; i < bucket.length; i++) {
        var tuple = bucket[i];
        if (tuple[0] === key) {
          tuple[1] = value;
          return;
        };
      };



      bucket.push([key, value]);
    }

  };
};


HashTable.prototype.remove = function(key) {
  var bucketIndex = hashFn(key, max);
  var bucket = this._storage[bucketIndex];

  for (var i = 0; i < bucket.length; i++) {
    var tuple = bucket[i];
    if (tuple[0] === k) {
      bucket.splice(i, 1);
    };
  };

};

//don't worry about this generic hashing function please, not the point of my question
var hashFn = function(str, max) {
  var hash = 0;
  for (var i = 0; i < str.length; i++) {
    var letter = str[i];
    hash = (hash << 5) + letter.charCodeAt(0);
    hash = (hash & hash) % max;
  }
  return hash;
};

Michael Laszlo · Accepted Answer

Your implementation of a hash table is correct. I should point out that what you've described in your question isn't collision detection but the operation of updating a key with a new value. A collision is when two different keys map to the same bucket, not when you insert a key and discover that there is a prior entry with the same key. You're already taking care of collisions by chaining entries in the same bucket.

In any case, you've gone about updating entries correctly. Let's say you've inserted the (key, value) pair ('a', 'ant') into the hash table. This means that 'a' maps to 'ant'. If you insert ('a', 'aardvark'), the intention is to overwrite the 'a' entry so that it now maps to 'aardvark'. Therefore, you iterate over the chain of entries and check for the key 'a' in the bucket. You find it, so you replace the value 'ant' with 'aardvark'. Now 'a' maps to 'aardvark'. Good.

Let's suppose you don't iterate over the chain of entries. What happens if you blindly append ('a', 'aardvark') to the end of the chain? The consequence is that when you look up the key 'a' and you go through the entries in the bucket, you come upon ('a', 'ant') first, so you return 'ant'. This is an incorrect result. You recently inserted ('a', 'aardvark'), so you should have returned 'aardvark'.

Ah, but what if you always start searching through the chain from the end? In other words, you're treating it as a stack. To insert an entry, you push it onto the end of the chain. To look up a key, you start searching from the end. The first entry with the given key is the one that was most recently inserted, so it's the correct one and you can return the value without searching further.

That implementation would be correct, but it would also make the chain longer than necessary. Consider what happens if you're using the hash table to count letter frequencies in a text file. Initially you insert ('a', 0) in the table. When you find the first occurrence of 'a' in the text, you read 0 from the table, add 1 to it, and insert ('a', 1) into the hash table. Now you have two entries in the chain with the key 'a', and only the one nearer to the end is valid. When you find the next occurrence of 'a', a third entry is added to the chain, and so on. Thousands of insertions with the same key result in thousands of entries in the chain.

Not only does this use up memory, it slows down other key insertions. For example, suppose your hash function assigns the same index to the keys 'a' and 'q'. This means that the 'q' entries are in the same bucket as the 'a' entries. If you have a whole bunch of 'a' entries at the end of the chain, you might have to go past many of them before you find the most recent entry with 'q'. For these reasons, it's better to do what you did.

One more thought: what if each entry is a tuple (key, values), where values is an array of values? Then, as you suggest, you can append a new value to the end of values in the event of a key collision. But if you do that, what is the meaning of values? It contains the values that were inserted with that key, in order of the time they were inserted. If you treat it as a stack and always return the last value in the list, you're wasting space. You may as well overwrite a single value.

Is there ever a case when you can get away with putting a new value into the bucket and not checking for an existing key? Yes, you can do that if you have a perfect hash function, which guarantees that there are no collisions. Each key gets mapped to a different bucket. Now you don't need a chain of entries. You have a maximum of one value in each bucket, so you can implement the hash table as an array that contains, at each index, either undefined or the most recently inserted value at that index. This sounds great, except it isn't easy to come up with a perfect hash function, especially if you want your hash table to contain no more buckets than necessary. You would have to know in advance every possible key that might be used in order to devise a hash function that maps the n possible keys to n distinct buckets.

Understanding of hash tables and collision detection

Answers (2)

Related Questions