pirox22
pirox22

Reputation: 922

Fast hash function for long string key

I am using an extendible hash and I want to have strings as keys. The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad for the program's performance since the hash function is called multiple times especially when I am splitting buckets.

Current hash function

int hash(const string& key)
{
    int seed = 131;
    unsigned long hash = 0;
    for(unsigned i = 0; i < key.length(); i++)
    {
        hash = (hash * seed) + key[i];
    }
    return hash;
}

The keys could be as long as 40 characters.

Example of string/key

string key = "from-to condition"

I have searched over the internet for a better one but I didn't find anything to match my case. Any suggestions?

Upvotes: 1

Views: 4841

Answers (3)

Richard Hodges
Richard Hodges

Reputation: 69942

I am using an extendible hash and I want to have strings as keys.

As mentioned before, use std::hash until there is a good reason not to.

The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad...

It's an understandable thought, but is actually unlikely to be a real concern.

(anticipating) why?

A quick scan over stack overflow will reveal many experienced developers talking about caches and cache lines.

(forgive me if I'm teaching my grandmother to suck eggs)

A modern CPU is incredibly quick at processing instructions and performing (very complex) arithmetic. In almost all cases, what limits its performance is having to talk to memory across a bus, which is by comparison, horribly slow.

So chip designers build in memory caches - extremely fast memory that sits in the CPU (and therefore does not have to be communicated with over a slow bus). Unhappily, there is only so much space available [plus heat constraints - a topic for another day] for this cache memory so the CPU must treat it like an OS does a disk cache, flushing memory and reading in memory as and when it needs to.

As mentioned, communicating across the bus is slow - (simply put) it requires all the electronic components on the motherboard to stop and synchronise with each other. This wastes a horrible amount of time [This would be a fantastic point to go into a discussion about the propagation of electronic signals across the motherboard being constrained by approximately half the speed of light - it's fascinating but there's only so much space here and I have only so much time]. So rather than transfer one byte, word or longword at a time, the memory is accessed in chunks - called cache lines.

It turns out that this is a good decision by chip designers because they understand that most memory is accessed sequentially - because most programs spend most of their time accessing memory linearly (such as when calculating a hash, comparing strings or objects, transforming sequences, copying and initialising sequences and so on).

What's the upshot of all this?

Well, bizarrely, if your string is not already in-cache it turns of that reading one byte of it is almost exactly as expensive as reading all the bytes in the first (say) 128 bytes of it.

Plus, because the cache circuitry assumes that memory access is linear, it will begin a fetch of the next cache line as soon as it has fetched your first one. It will do this while your CPU is performing its hash computation.

I hope you can see that in this case, even if your string was many thousands of bytes long, and you chose to only hash (say) every 128th byte, all you would be doing would be to compute a very much inferior hash which still causing the memory cache to halt the processor while it fetched large chunks of unused memory. It would take just as long - for a worse result!

Having said that, what are good reasons not to use the standard implementation?

Only when:

  1. The users are complaining that your software is too slow to be useful, and

  2. The program is verifiably CPU-bound (using 100% of CPU time), and

  3. The program is not wasting any cycles by spinning, and

  4. Careful profiling has revealed that the program's biggest bottleneck is the hash function, and

  5. Independent analysis by another experienced developer confirms that there is no way to improve the algorithm (for example by calling hash less often).

In short, almost never.

Upvotes: 3

eldruin
eldruin

Reputation: 342

You can directly use std::hashlink instead of implementing your own function.

#include <iostream>
#include <functional>
#include <string>

size_t hash(const std::string& key)
{
    std::hash<std::string> hasher;
    return hasher(key);
}

int main() {
    std::cout << hash("abc") << std::endl;
    return 0;
}

See this code here: https://ideone.com/4U89aU

Upvotes: 1

You should prefer to use std::hash unless measurement shows that you can do better. To limit the number of characters it uses, use something like:

    const auto limit = min(key.length(), 16);
    for(unsigned i = 0; i < limit; i++)

You will want to experiment to find the best value of 16 to use.

I would actually expect the performance to get worse (because you will have more collisions). If your strings were several k, then limiting to the first 64 bytes might well be worth while.

Depending on your strings, it might be worth starting not at the beginning. For example, hashing filenames you would probably do better using the characters between 20 and 5 from the end (ignore the often constant pathname prefix, and the file extension). But you still have to measure.

Upvotes: 4

Related Questions