ApathyBear
ApathyBear

Reputation: 9615

Hashtable size said to be arbitrary, why?

I am learning about abstract data types here. Lately I have been reading about hashing with a Map (or some data structure like a dict).

Here is how the code looks like:

class HashTable:
    def __init__(self):
        self.size = 11
        self.slots = [None] * self.size
        self.data = [None] * self.size

def put(self,key,data):
  hashvalue = self.hashfunction(key,len(self.slots))

  if self.slots[hashvalue] == None:
    self.slots[hashvalue] = key
    self.data[hashvalue] = data
  else:
    if self.slots[hashvalue] == key:
      self.data[hashvalue] = data  #replace
    else:
      nextslot = self.rehash(hashvalue,len(self.slots))
      while self.slots[nextslot] != None and \
                      self.slots[nextslot] != key:
        nextslot = self.rehash(nextslot,len(self.slots))

      if self.slots[nextslot] == None:
        self.slots[nextslot]=key
        self.data[nextslot]=data
      else:
        self.data[nextslot] = data #replace

def hashfunction(self,key,size):
     return key%size

def rehash(self,oldhash,size):
    return (oldhash+1)%size

def get(self,key):
  startslot = self.hashfunction(key,len(self.slots))

  data = None
  stop = False
  found = False
  position = startslot
  while self.slots[position] != None and  \
                       not found and not stop:
     if self.slots[position] == key:
       found = True
       data = self.data[position]
     else:
       position=self.rehash(position,len(self.slots))
       if position == startslot:
           stop = True
  return data

def __getitem__(self,key):
    return self.get(key)

def __setitem__(self,key,data):
    self.put(key,data)

Now within the textbook, the author states that the size of the hashtable is arbitrary. See here:

Note that the initial size for the hash table has been chosen to be 11. Although this is arbitrary, it is important that the size be a prime number so that the collision resolution algorithm can be as efficient as possible.

Why is this arbitrary? It would seem that the number of slots given is directly correlated to how many values can be stored. I know that other hashtables may be flexible and be able to store more data into one data slot, but in THIS specific example, it isn't just 'arbitrary'. It is exactly how many values can be stored.

Am I missing something here?

Upvotes: 0

Views: 1630

Answers (3)

abarnert
abarnert

Reputation: 366103

The Paramagnetic Croiss answered your main question. The number 11 does of course mean that you can't fit more than 11 elements without reallocating your table and rehashing all your elements, so obviously it's not arbitrary in that sense. But it's arbitrary in the sense that as long as the number is prime (and, yes, larger than the number of inserts you're going to do), everything the author intends to demonstrate will work out the same.*

* In particular, if your elements are natural numbers, and your table size is prime, and small enough compared to the largest integer, % size makes a great hash function.

But for your followup question:

It would seem though, that making a table with a bigger prime number would allow for you to have more available slots and require less need for you to rehash, and have less items to search through in each slot (if you extended the data slots to hold more than one value). The items would be spread thinner in general. Is this not correct?

If I understand you right, you're not using the right words, which is why you're getting confusing answers. Your example code uses a function called rehash, but that's misleading. Rehashing is one way to do probing, but it's not the way you're doing it; you're just doing a combination of linear probing and double hashing.* More commonly, when people talk about rehashing, they're talking about what you do after you grow the hash table and have to rehash every value from the old table into the new one.

* When your hash function is as simple as key%size, the distinction is ambiguous…

Anyway, yes, more load (if you have N elements in M buckets, you have N/M load) means more probing, which is bad. To take the most extreme element, at load 1.0, the average operation will have to probe through half the table to find the right bucket, making the hash table as inefficient as brute-force searching an array.

However, as you decrease load, the returns drop off pretty fast. You can draw the exact curve for any particular hash implementation, but the rule of thumb you usually use (for closed hashes like this) is that getting the load below 2/3 is usually not worth it. And keep in mind that a larger hash table has costs as well as benefits. Let's say you're on a 32-bit machine with a 64-byte cache line. So, 11 pointers fit in a single cache line; after any hash operation, the next one is guaranteed to be a cache hit. But 17 pointers are split across two cache lines; after any hash operation, the next one only has a 50% chance of being a cache hit.*

* Of course realistically there's plenty of room inside your loop to use up 2 cache lines for a hash table; that's why people don't usually worry about performance at all when N is in single digits… But you can see how with larger hash tables, keeping too much empty space an mean more L1 cache misses, more L2 cache misses, in the worst case even more VM page misses.

Upvotes: 1

Why is this arbitrary?

Because he could have chosen any other small prime.

It would seem that the number of slots is directly correlated with […] how many values can be stored

Yep, and that's irrelevant. If you need to grow your hash table, you resize (reallocate) and re-hash it. This is not what's the author is talking about.

Upvotes: 2

Drakosha
Drakosha

Reputation: 12165

Well, nobody can predict the future, as you never know how many values the data structure user will actually put in the container. So you start with something small, not to eat too much memory, and then increase and rehash as needed.

Upvotes: 0

Related Questions