Некто
Некто

Reputation: 1820

Implementation of hashtable/dictionary in python

I've been studying data structures in python and I created a simple implementation of a dictionary which is below. While this implementation is in the end useless (could just use hash() instead of creating a hash function etc.), I'm interested in the nitty gritty of how it all fits together.

This implementation chooses a starting size of 11. self.capacity keeps track of the number of free slots remaining. When a (key, value) pair is added, it decreases by one, and once it hits 0, it triggers a new slot to be created every time one is needed.

My question is this: the hashvalues that are being computed from the hash function depend on len(self.slots), but this value is constantly increasing when I add more space to my dictionary. Instead of using len(self.slots) to calculate the hashfunction, I tried just using the initial size (11), but as soon as the dictionary attempts to add the 12th (key,value) pair, the program seems to get stuck. This seems to suggest that the hash function needs to be based on the size of the table, and that in order to keep adding elements I need to be able to increase the size of the table. This leads me to the following questions.

  1. Do dictionaries have to be initialized to a fixed length and remain that length? If not, what is the preferred way of incrementing the length as elements are added?
  2. If the length of the dictionary can change, should the hash function be constructed without using the dictionary size? How could I ensure that values would still reduce to those in the range of table slots without reducing via modulo table size?

Any explanations, interesting insights or useful tidbits would be greatly appreciated.

#

class HashTable:
    def __init__(self):
        self.size = 11
        self.capacity = self.size
        self.slots = [None] * self.size
        self.data =  [None] * self.size

    def hashfunction(self, key, size):
        return key%size

    def rehash(self, oldhash, size):
        return (oldhash+1)%size

    def put(self, key, value):
        hashvalue = self.hashfunction(key,len(self.slots))

        if self.capacity < 1:
            self.slots += [None]
            self.data += [None]
            self.capacity += 1

        if self.slots[hashvalue] == None:
            self.slots[hashvalue] = key
            self.data[hashvalue] = value
            self.capacity -= 1
        else:
            if self.slots[hashvalue] == key:
                self.data[hashvalue] = data
            else:
                rehashed = self.rehash(hashvalue, len(self.slots))
                while self.slots[rehashed] != None and self.slots[rehashed] != key:
                    rehashed = self.rehash(rehashed, len(self.slots))
                if self.slots[rehashed] == None:
                    self.slots[rehashed] = key
                    self.data[rehashed] = value
                    self.capacity -= 1
                else:
                    self.data[rehashed] = value

    def get(self, key):
        startslot = self.hashfunction(key, len(self.slots))
        data = None
        found = False
        stop = False
        position = startslot
        while self.slots[position] != None and not found and not stop:
            if self.slots[position] == key:
                data = self.data[key]
                found = True
            else:
                position = self.rehash(position, len(self.slots))
                if position == startslot:
                    stop = True
        return data

    def __delitem__(self, key):
        hashvalue = self.hashfunction(key, len(self.slots))

        if self.slots[hashvalue] == key:
            self.slots[hashvalue] = None
            self.data[hashvalue] = None
        else:
            rehashed = self.hashfunction(hashvalue, len(self.slots))
            while self.slots[rehashed] != key:
                rehashed = self.hashfunction(rehashed, len(self.slots))
            if self.slots[rehashed] == key:
                self.slots[rehashed] == None
                self.data[rehashed] == None

    def __contains__(self, key):
        return key in self.slots

    def __getitem__(self, key):
        return self.get(key)

    def __setitem__(self, key, value):
        self.put(key, value)

Upvotes: 3

Views: 2596

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121158

You need to keep your hash and your table size separate. The hash should be based on the key alone, not the size. Store the following information per key-value entry:

  • The key
  • The hash - a constant value derived from the key. Make sure it has breath to avoid too many collisions.
  • The value

You pick a slot based on the table size and the hash:

slot = hash % tablesize

Then, when you run out of space in your current table, produce a new table (say, double the size), to accommodate your growing data set, and reslot everything. You already have the hashes cached, all you have to do is take each (key, hash, value) tuple and re-calculate the new slot with the above formula, now with a larger tablesize.

You'll also have to decide how to handle collisions; two hashes that given the current table size, end up in the same slot. Python's dict uses open addressing, where the hash is 'perturbed' in a reproducible fashion until another empty slot is found.

You may want to study the Python dict source code to see how they do this, see the extensive comments on how collisions are handled. You could also watch this PyCon presentation where Brandon Rhodes explains all those details with some very enlightening graphics. Or you could pick up a copy of Beautiful Code which has a whole chapter on the Python dict implementation.

Upvotes: 8

Related Questions