Reputation: 420
A typical hash table lookup algorithm - including one of the ones claiming to be the fastest in the world - is structured something a little bit like this.
while (true) {
if (currentSlot.isEmpty) return null;
if (currentSlot.key == key) return currentSlot.value;
currentSlot = GetNextSlot();
}
The important point is that it checks each slot, stops if it finds the right key or if it reaches the end, or continues searching if it doesn't. This is pseudocode illustrating the pattern, not a real implementation.
This looks like it should be a branch prediction nightmare. When the table is very full or very empty, prediction should be quite reliable, but under normal use I would expect the branching during the search to be quite random as it's dependent on the stored data.
I expected to find that high performance hash tables would optimise with tricks like checking batches of four keys at once between branches to reduce mispredictions, but this doesn't seem to be true.
Are branch mispredictions a significant cost in hash table lookups? If they are, why don't implementations typically try to avoid them? If they're not, why aren't they?
Upvotes: 2
Views: 509
Reputation: 120848
Yes, it might hurt if there are multiple keys in a slot; that is why the performance of a hash
table is amortized O(1)
.
Usually, the thing you are looking for will be in the very first slot and there are code patterns for that, for example java's HashMap
has this piece of code :
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
if (first.hash == hash && // always check first node
Notice the always check first node
. This is a very common pattern, so they treat it specifically.
Upvotes: 0
Reputation: 50338
Are branch mispredictions a significant cost in hash table lookups?
This is highly dependent of the test-case. Indeed, when the hash-table is too big to fit in CPU caches, the latency of the main memory (usually 60-120 ns) is much bigger than the cost of branch mispredictions (usually 10-15 cycles). The same thing apply with the L3 cache although the effect is less visible.
If they are, why don't implementations typically try to avoid them?
The main reason is that this is hard and I guess not always possible (especially when the key and the value are (non-POD) objects.
Upvotes: 2
Reputation: 364180
My understanding is that for good performance, you want to make your hash tables large enough that chaining to another bucket is rare. The rarer it is, the less it matters that the branch will usually mispredict (because branch prediction learns from the common cases that branching avoids the path that leads to chaining to the next bucket).
A good hash table will choose its chaining algorithm to distribute well, avoiding big buildups of long chains until the hash table is very nearly full. (If we're talking about "stealing" another bucket in the main table, rather than building a linked list for this entry. Linked lists are relatively slow to traverse / linear search because of the data dependency for following the next
pointer.)
Also note that a cache-miss (or even L3 cache hit) may cost more than a branch miss, on modern CPUs with fast recovery that doesn't have to wait for the out-of-order back-end to drain before starting recovery. (What exactly happens when a skylake CPU mispredicts a branch?
Chaining to the adjacent bucket might not be optimal; I forget what the state of the art is. IIRC it's not because one chain creates a dense group of used-up buckets, so any other entries for those buckets would also have to chain, making the problem worse. You don't want to do more scattered memory fetches than you need.
Otherwise, if you were just chaining to adjacent entries in the array you're using as a hash table, then yes, branchlessly fetching some adjacent keys (from the same or next cache line) could be fine. If the entries are small enough, checking them in parallel with SIMD could possibly even be worth it. At least for integer keys or something else that can be compared efficiently; it wouldn't be worth doing 4 string-compares in parallel.
So yes, cache misses are bad. But if most of your lookups find their hit in the first bucket, then that's an easy pattern for branch prediction. In many use-cases for hash tables, the majority of lookups will be for existing entries that get used repeatedly.
More sophisticated branch prediction (like Intel Haswell and later's IT-TAGE) use recent branch history to form an index into the table of branch-prediction entries, so different branching leading to different kinds of lookups can use different predictions for the same branch in the same hash table code. So e.g. one function that typically finds entries not-present might have its hash lookups predict correctly (if they don't chain), while another function that looks up known stuff can also predict that correctly.
Still, in the case where your hash table is starting to fill up and you have a non-negligible amount of chaining, branch mispredicts would be something to worry about. (And measure with hardware performance counters, e.g. perf stat
or perf record -e branch-misses ./my_hashtable_benchmark
; perf report
on GNU/Linux.)
Upvotes: 1