fibono
fibono

Reputation: 783

How does a hash (in a language like Ruby) work "under the hood"?

I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is

hash = {}    
hash[key] = value

and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:

hash[:a] = 1
hash[:b] = 2
...

What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?

Upvotes: 1

Views: 729

Answers (3)

Dariya L.
Dariya L.

Reputation: 31

In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.

Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.

The hash table stores all values in the store (bins) groups, in a data structure similar to an array.

How does hash table work

When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.

Roughly speaking, hash returns the equivalent of the link to the memory location where the current object is stored. However, for strings, the calculation is relative to the value.

Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.

'a'.hash % 16 =>9 a - key 16 - amount of storage 9 - the storage number

So, in Ruby the insertion works in the following way:

How insertion works

  1. It takes the hash of the key using the internal hash function. :c.hash #=> 2782

  2. After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16

  3. Add key-value to a linked list of the proper bin

The search works as follows:

The search works quite the same way:

  1. Determine "hash" function;
  2. Find "storage";
  3. Then iterate through the list and retrieve a hash element.

In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).

If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.

To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.

"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.

81 > 5 * 16 - rehash will be called when we add 81 elements to the table.

num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins

When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.

Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)

Upvotes: 1

Jörg W Mittag
Jörg W Mittag

Reputation: 369624

The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash class. Every implementation is allowed to implement it however they want, provided they honor the contract.

For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)

Rubinius also alternatively implements the Hash class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb [Note: this implementation uses three VM primitives written in C++.]

You can switch between those two implementations using a configuration option. So, not only is the Hash implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!

In IronRuby, Ruby's Hash class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>: Ruby/Builtins/Hash.cs In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs

Upvotes: 3

Mircea
Mircea

Reputation: 10566

If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using: https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c

The hash itself is here: https://github.com/ruby/ruby/blob/trunk/hash.c

Most of the times, the article diego provided in comments will be more than enough

Upvotes: 2

Related Questions