Reputation: 1820
I've been studying data structures in python and I created a simple implementation of a dictionary which is below. While this implementation is in the end useless (could just use hash()
instead of creating a hash function etc.), I'm interested in the nitty gritty of how it all fits together.
This implementation chooses a starting size of 11. self.capacity
keeps track of the number of free slots remaining. When a (key, value) pair is added, it decreases by one, and once it hits 0, it triggers a new slot to be created every time one is needed.
My question is this: the hashvalues that are being computed from the hash function depend on len(self.slots)
, but this value is constantly increasing when I add more space to my dictionary. Instead of using len(self.slots)
to calculate the hashfunction, I tried just using the initial size (11), but as soon as the dictionary attempts to add the 12th (key,value) pair, the program seems to get stuck. This seems to suggest that the hash function needs to be based on the size of the table, and that in order to keep adding elements I need to be able to increase the size of the table. This leads me to the following questions.
Any explanations, interesting insights or useful tidbits would be greatly appreciated.
#
class HashTable:
def __init__(self):
self.size = 11
self.capacity = self.size
self.slots = [None] * self.size
self.data = [None] * self.size
def hashfunction(self, key, size):
return key%size
def rehash(self, oldhash, size):
return (oldhash+1)%size
def put(self, key, value):
hashvalue = self.hashfunction(key,len(self.slots))
if self.capacity < 1:
self.slots += [None]
self.data += [None]
self.capacity += 1
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = value
self.capacity -= 1
else:
if self.slots[hashvalue] == key:
self.data[hashvalue] = data
else:
rehashed = self.rehash(hashvalue, len(self.slots))
while self.slots[rehashed] != None and self.slots[rehashed] != key:
rehashed = self.rehash(rehashed, len(self.slots))
if self.slots[rehashed] == None:
self.slots[rehashed] = key
self.data[rehashed] = value
self.capacity -= 1
else:
self.data[rehashed] = value
def get(self, key):
startslot = self.hashfunction(key, len(self.slots))
data = None
found = False
stop = False
position = startslot
while self.slots[position] != None and not found and not stop:
if self.slots[position] == key:
data = self.data[key]
found = True
else:
position = self.rehash(position, len(self.slots))
if position == startslot:
stop = True
return data
def __delitem__(self, key):
hashvalue = self.hashfunction(key, len(self.slots))
if self.slots[hashvalue] == key:
self.slots[hashvalue] = None
self.data[hashvalue] = None
else:
rehashed = self.hashfunction(hashvalue, len(self.slots))
while self.slots[rehashed] != key:
rehashed = self.hashfunction(rehashed, len(self.slots))
if self.slots[rehashed] == key:
self.slots[rehashed] == None
self.data[rehashed] == None
def __contains__(self, key):
return key in self.slots
def __getitem__(self, key):
return self.get(key)
def __setitem__(self, key, value):
self.put(key, value)
Upvotes: 3
Views: 2596
Reputation: 1121158
You need to keep your hash and your table size separate. The hash should be based on the key alone, not the size. Store the following information per key-value entry:
You pick a slot based on the table size and the hash:
slot = hash % tablesize
Then, when you run out of space in your current table, produce a new table (say, double the size), to accommodate your growing data set, and reslot everything. You already have the hashes cached, all you have to do is take each (key, hash, value)
tuple and re-calculate the new slot with the above formula, now with a larger tablesize.
You'll also have to decide how to handle collisions; two hashes that given the current table size, end up in the same slot. Python's dict
uses open addressing, where the hash is 'perturbed' in a reproducible fashion until another empty slot is found.
You may want to study the Python dict
source code to see how they do this, see the extensive comments on how collisions are handled. You could also watch this PyCon presentation where Brandon Rhodes explains all those details with some very enlightening graphics. Or you could pick up a copy of Beautiful Code which has a whole chapter on the Python dict
implementation.
Upvotes: 8