Reputation: 1395
I am looking to for a hash table
data structure that does not require rehash
for expansion and shrink?
Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
Upvotes: 3
Views: 3220
Reputation: 106196
does not require rehash for expansion and shrink? Rehash is a CPU consuming effort. I was wondering if it is possible to design hash table data structure in a way that does not require rehash at all? Have you heard about such a data structure before?
That depends on what you call "rehash":
If you simply mean that the table-level rehash shouldn't reapply the hash function to each key during resizing, then that's easy with most libraries: e.g. wrap the key and its raw (pre-modulo-table-size) real hash value together a la struct X { size_t hash_; Key key_ };
, supply the hashtable library with a hash function that returns hash_
, but a comparison function that compares key_
s (depending on the complexity of key_
comparison, you may be able to use hash_
to optimise, e.g. lhs.hash_ == rhs.hash_ && lhs.key_ == rhs.key_
).
int
s) it'll slow you down and waste memory.If you mean the table-level operation of increasing or decreasing memory storage and reindexing all stored values, then yes - it can be avoided - but to do so you have to fundamentally change the way the hash table works, and the normal performance profile. Discussed below.
As just one example, you could leverage a more typical hashtable implementation (let's call it H) by having your custom hashtable (C) have an H** p
that - up to an initial size limit - will have p[0]
be the only instance of H, and simply ferry operations/results through. If the table grows beyond that, you keep p[0]
referencing the existing H, while creating a second H hashtable to be tracked by p[1]
. Then things start getting dicey:
to search or erase in C, your implementation needs to search p[1]
then p[0]
and report any match from either
to insert a new value in C, your implementation must confirm it's not in p[0]
, then insert to p[1]
insert
(and potentially even for other operations), it could optionally migrate any matching - or an arbitrary p[0]
entry - to p[1]
so gradually p[0]
empties; you can easily guarantee p[0]
will be empty before p[1]
will be so full (and consequently a larger table will be needed). When p[0]
is empty you may want to p[0] = p[1]; p[1] = NULL;
to keep the simple mental model of what's where - lots of options.Some existing hash table implementations are very efficient at iterating over elements (e.g. GNU C++ std::unordered_set
), as there's a singly linked list of all the values, and the hash table is really only a collection of pointers (in C++ parlance, iterators) into the linked list. This can mean that if your utilisation falls below some threshold (e.g. 10% load factor) for your only/larger hash table, you know you can very efficiently migrate the remaining elements to a smaller table.
These kind of tricks are used by some hash tables to avoid a sudden heavy cost during rehashing, and instead spread the pain more evenly over a number of subsequent operations, avoiding a possibly nasty spike in latency.
Some of the implementation options only make sense for either an open or a closed hashing implementation, or are only useful when the keys and/or values are small or large and depending on whether the table embeds them or points to them. Best way to learn about it is to code....
Upvotes: 2
Reputation: 2902
Assuming you actually do need this.. It is possible. Here I'll give a trivial example you can build on.
// Basic types we deal with
typedef uint32_t key_t;
typedef void * value_t;
typedef struct
{
key_t key;
value_t value;
} hash_table_entry_t;
typedef struct
{
uint32_t initialSize;
uint32_t size; // current max entries
uint32_t count; // current filled entries
hash_table_entry_t *entries;
} hash_table_t;
// Hash function depends on the size of the table
key_t hash(value_t value, uint32_t size)
{
// Simple hash function that just does modulo hash table size;
return *(key_t*)&value % size;
}
void init(hash_table_t *pTable, uint32_t initialSize)
{
pTable->initialSize = initialSize;
pTable->size = initialSize;
pTable->count = 0;
pTable->entries = malloc(pTable->size * sizeof(*pTable->entries));
/// @todo handle null return;
// Set to ~0 to signal invalid keys.
memset(pTable->entries, ~0, pTable->size * sizeof(*pTable->entries));
}
void insert(hash_table_t *pTable, value_t val)
{
key_t key = hash(val, pTable->size);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
{
pTable->entries[i].key = key;
pTable->entries[i].value = val;
pTable->count++;
break;
}
}
// Expand when 50% full
if (pTable->count > pTable->size/2)
{
pTable->size *= 2;
pTable->entries = realloc(pTable->entries, pTable->size * sizeof(*pTable->entries));
/// @todo handle null return;
memset(pTable->entries + pTable->size/2, ~0, pTable->size * sizeof(*pTable->entries));
}
}
_Bool contains(hash_table_t *pTable, value_t val)
{
// Try current size first
uint32_t sizeToTry = pTable->size;
do
{
key_t key = hash(val, sizeToTry);
for (key_t i = key; i != (key-1); i=(i+1)%pTable->size)
{
if (pTable->entries[i].key == ~0)
break;
if (pTable->entries[i].key == key && pTable->entries[i].value == val)
return true;
}
// Try all previous sizes we had. Only report failure if found for none.
sizeToTry /= 2;
} while (sizeToTry != pTable->initialSize);
return false;
}
The idea is that the hash function depends on the size of the table. When you change the size of the table, you don't rehash current entries. You add new ones with the new hash function. When reading the entries, you try all the hash functions that have ever been used on this table.
This way, get()
/contains()
and similar operations take longer the more times you expanded your table, but you don't have the huge spike of rehashing. I can imagine some systems where this would be a requirement.
Upvotes: 0
Reputation: 144949
It depends what you want to avoid. Rehashing implies recomputing the hash values. You can avoid that by storing the hash values in the hash structures. Redispatching the entries into the reallocated hashtable may be less expensive (typically a single modulo or masking operation) and is hardly avoidable for simple hashtable implementations.
Upvotes: 1