Reputation: 2059

Hash table resizing: how do we do it without knowing the key?

When we store only the resultant data for a key in a hashtable, how do I perform hashtable resizing? One example I can think of is to store username & password combinations. For the sake of privacy, we'd only store passwords (there can be many other use cases too). Here, the table just stores the data but not the key. Given this, if I want to copy entries from the old table to the new during resizing, I don't have the key to hash upon.

How is resizing done here?

Upvotes: 2

Answers (4)

Jerry Coffin

Reputation: 490673

In a case like this, you're not really using a hash table at all -- or at least the hash you're talking about isn't being used as the hash for the table.

In other words, the hash (e.g., SHA-256) of the password is just another piece of data, being stored in the table (hash table or otherwise). It's only modified if/when the password changes.

It might be stored in a hash table keyed by, say, the username -- but if so, you're going to have the user name here so you can re-hash when needed.

If, for whatever reason, you decided to use the secure hash of the password as a key into the table, the complete hash would be the key, and what you'd use to index into the table would be some hash of that hash (e.g., two-byte chunks all XORed together).

Edit: As far as table resizing (by itself) goes, no, storing the key is not an absolute necessity. You can get by with just storing the remainder of the original hash code. In other words, you'll typically start by generating, say, a 32-bit hash. You then use part (but only part) of that to index into your table (e.g., 16 bits). When it comes time to resize the table, you will take the position in the table for 16 bits, and the stored remainder of the hash for the other 16 bits to recover the original 32-bit hash. Assuming you're doubling the size of your table, you'll then use 17 bits for the index, and store the remaining 15 bits in the table.

If you didn't mind going a bit insane with the hash itself, you could also use this to eliminate any real need to store the key at all. For example, if you started by creating a 256-bit hash (e.g., SHA-256), you could use N bits as the index into the hash table, and the remaining bits as if it were the key. If your actual keys were longer than 256 bits, this would lead to the possibility of collisions -- but collisions with SHA-256 are sufficiently rare that the odds of encountering one by chance are almost certainly lower than the chances of a computer error during a key comparison, so it said two keys were identical when they really weren't.

Upvotes: 1

cHao

Reputation: 86575

Short version: Hashtables can't work with only the hash codes of their data; they must contain actual data. If all you provide to the hash table is a hash, then it's the data. A hashtable of hashes, which is what you seem to be trying to create, would not have problems with resizing. As far as the hashtable is concerned, the hash you inserted is the data, and whatever was hashed to produce the data is irrelevant.

Long version:

A hashtable and a table of hashes are two orthogonal ideas. It sounds like you're conflating the two in a way that doesn't make much sense in either context.

For a hashtable, the entire point is to map keys unambiguously to values. Without both, the table is useless. The hash code that a hashtable uses is not, and can not be, logically separated from the data -- it comes from the data, and is often calculated on the fly. The sole purpose of the hash code is to quickly locate (or store) the actual key->value mapping; you still need an actual key and an actual value to map.

The big reason for needing the actual key is, the hash code only goes so far. To some degree, every useful hashtable and every hash function is inherently bound by the pigeonhole principle. This means

Every hash function by definition can return equal hash codes for two different values; and
Every useful hashtable has to resolve collisions somehow (between equal hash codes and/or between different hash codes that point at the same bucket).

Hashtables are actually doubly affected by the pigeonhole principle; they not only have to reduce the data to a (typically int-sized) hash code, but since most hash tables can't reasonably have 4 billion buckets, the code then has to do some further reduction with that hash code to turn it into a bucket number. (A common example is taking the hash code times some prime number, mod the number of buckets.)

Even read-only hashtables typically have to resolve collisions. Imagine a read-only hash table that uses a hash function so perfect that every value in the table ends up in its own bucket. Even in this case, what happens when you try to find a key that's not in the table, but produces a hash code that resolves to a bucket containing a key that is in the table? Either you need the value itself to double-check against, or the table lies and says the key exists, even if the hash code is not actually equal!

Basically the table uses the hash code of the lookup key (or, more typically, the return value from some function on that hash code) to figure out what bucket to look in, and then compares each actual key in that bucket with the actual lookup key in order to disambiguate. This wouldn't work without an actual key of some sort, unless:

your hash function was guaranteed to produce a unique value for every input (making it no longer a hash function, but an encoding or encryption function), and
you had an infinite number of buckets.

The only way you could have hashes in a hash table without the original data, is if the hashes are the data. In that case you have a hashtable of hashes. In the case you're mentioning, you might use a hash of the username as a key, mapping to a SHA /mcrypt/bcrypt/whatever hash of the password. In that case, the hash is the key, and no one cares about the username anymore. This would not cause problems with resizing or whatever, as the hash function the hashtable uses has very little to do with the hash function you used; as far as the hashtable cares, the hash you gave it is just another value, and has its own hash code, and that hash code is what the hashtable uses internally to find stuff.

I might warn you against using hashes for usernames, though. I'd suggest that at least one piece of the auth data be collision-proof, and hashes are by their very nature not collision-proof. And i'll personally tear up your programming license if i catch you running around storing unhashed passwords in a potentially user-visible location. :) Perhaps if you encrypted the username instead of hashing it? That'd effectively guarantee its uniqueness, and if you use public-key encryption and "forget" the private key, it's effectively as irreversible as hashing. Nifty side effect is, public-key encryption is generally kinda slow due to how it works. You're basically taking a 1024-bit or larger number and multiplying it by itself thousands of times. So you have some built-in protection against brute force.

Upvotes: 1

amulware

Reputation: 412

This does not answer your question I suppose, but I wonder how a hashtable can be completely ignorant of its entry's keys. Unless of course you know for certain that there will never be collisions. I do not see how that could be possible in the given example of the key being a username(unless the hashtable was ridiculously big to begin with).

Either way, if we are talking about an actual hashtable, and not a direct access table, I do not think it is logically possible to copy entries between tables of different sizes without knowing the key, as usually the size of the hashtable is an important part of the hash function.

Upvotes: 0

melpomene

Reputation: 85897

You do it by storing the original key as well. In the case of passwords, you don't store them in a hashtable.

Upvotes: 0

Hash table resizing: how do we do it without knowing the key?

Answers (4)

Related Questions