Unknown
Unknown

Reputation: 46791

Does a hash function output need to be bounded less than the number of buckets?

I was reading about this person's interview "at a well-known search company".

http://asserttrue.blogspot.com/2009/05/one-of-toughest-job-interview-questions.html

He was asked a question which led to him implementing a hash table. He said the following:

HASH = INITIAL_VALUE;
FOR EACH ( CHAR IN WORD ) {
HASH *= MAGIC_NUMBER
HASH ^= CHAR
HASH %= BOUNDS
}
RETURN HASH

I explained that the hash table array length should be prime, and the BOUNDS number is less than the table length, but coprime to the table length.

Why should the BOUNDS number be less than the number of buckets? What does being coprime to the table length do? Isn't it supposed to be coprime to the BOUNDS?

Upvotes: 2

Views: 668

Answers (3)

patros
patros

Reputation: 7819

A simple explicit suffix tree would only use worst case maybe 500k memory (with a moderately efficient implementation, 4 byte character encodings, and relatively long English words that have minimal overlap) to do the same thing.

I think the guy in the article outsmarted himself.

Upvotes: 0

Peter
Peter

Reputation:

I once interviewed for a job at a well-known search company. I got the exact same question. I tried to tackle it by using hash table.

One thing that I learnt from that interview was that at a well-known search company, you do not propose hashes as solutions. You use any tree-like structure you like but you always use ordered structure, not hash table.

Upvotes: 0

Tom Leys
Tom Leys

Reputation: 19029

I would hazard that he is completely wrong. BOUNDS should be the number of buckets or the last few buckets are going to be underused.

Further, the bounding of the output to the number of buckets should be OUTSIDE the hash function. This is an implementation detail of that particular hash table. You might have a very large table using lots of buckets and another using few. Both should share the same string->hash function

Further, if you read the page that you linked to it is quite interesting. I would have implemented his hash table as something like 10,000 buckets - For those who haven't read it, the article suggests ~ 4,000,000,000 buckets to store 1,000,000 or so possible words. For collisions, each bucket has a vector of word structures, each of those containing a count, a plaintext string and a hash (unique within the bucket). This would use far less memory and work better with modern caches since your working set would be much smaller.

To further reduce memory usage you could experiment with culling words from the hash during the input phase that look like they are below the top 100,000 based on the current count.

Upvotes: 4

Related Questions