Reputation: 783
I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is
hash = {}
hash[key] = value
and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:
hash[:a] = 1
hash[:b] = 2
...
What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?
Upvotes: 1
Views: 729
Reputation: 31
In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.
Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.
The hash table stores all values in the store (bins) groups, in a data structure similar to an array.
When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash
method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.
Roughly speaking, hash
returns the equivalent of the link to the memory location where
the current object is stored. However, for strings, the calculation is relative to the value.
Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.
'a'.hash % 16 =>9
a - key
16 - amount of storage
9 - the storage number
So, in Ruby the insertion works in the following way:
It takes the hash of the key using the internal hash function.
:c.hash #=> 2782
After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16
Add key-value to a linked list of the proper bin
The search works as follows:
The search works quite the same way:
In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).
If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.
To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.
"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.
81 > 5 * 16 - rehash will be called when we add 81 elements to the table.
num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins
When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.
Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)
Upvotes: 1
Reputation: 369624
The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash
class. Every implementation is allowed to implement it however they want, provided they honor the contract.
For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb
This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)
Rubinius also alternatively implements the Hash
class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb
[Note: this implementation uses three VM primitives written in C++.]
You can switch between those two implementations using a configuration option. So, not only is the Hash
implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!
In IronRuby, Ruby's Hash
class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>
: Ruby/Builtins/Hash.cs
In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs
Upvotes: 3
Reputation: 10566
If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using: https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c
The hash itself is here: https://github.com/ruby/ruby/blob/trunk/hash.c
Most of the times, the article diego provided in comments will be more than enough
Upvotes: 2