Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

4-ary heaps in Java

Binary heaps are commonly used in e.g. priority queues. The basic idea is that of an incomplete heap sort: you keep the data sorted "just enough" to get out the top element quickly.

While 4-ary heaps are theoretically worse than binary heaps, they do also have some benefits. For example, they will require less heap restructuring operations (as the heap is much shallower), while obvisouly needing more comparisons at each level. But (and that probably is their main benefit?) they may have better CPU cache locality. So some sources say that 3-ary and 4-ary heaps outperform both Fibonacci and binary heaps in practise. They should not be much harder to implement, the additional cases are just some extra if cases.

Has anyone experimented with 4-ary heaps (and 3-ary) for priority queues and done some benchmarking? In Java you never know if they are faster or slower before you benchmarked them extensively. And from all I've found via Google, it may be quite language and use case dependant. Some sources say that they found 3-ary to perform best for them.

Some more points:

Upvotes: 6

Views: 2701

Answers (3)

Erich Schubert
Erich Schubert

Reputation: 8705

I have not benchmarked 4-ary heaps yet. I'm currently trying to optimize our own heap implementations, and I'm trying 4-ary heaps there, too. And you are right: we will need to benchmark this carefully, as it is easy to get mislead by implementation differences and Hotspot optimization will heavily affect the results. Plus, small heaps will probably show different performance characteristics than large heaps.

The Java PriorityQueue is a very simple heap implementation, but that means Hotspot will optimize it well. It's not bad at all: most people would implement a worse heap. But, for example, it indeed does not do efficient bulk loads or bulk adds (bulk repairs). However, in my experiments it was hard to consistently beat this implementation even in simulations with repeated inserts, unless you go for really large heaps. Furthermore, in many situations it pays off to replace the top element in the heap instead of poll() + add(); this is not supported by java's PriorityQueue.

Some of the performance gains in ELKI (and I've seen you are an ELKI user) across versions are actually due to improved heap implementations. But it's an up and down, it's hard to predict which heap variation performs best across real workloads. The key benefit of our implementation is probably to have a "replaceTopElement" function. You can inspect the code here:

SVN de.lmu.ifi.dbs.elki.utilities.heap package

You will notice we have a whole set of heaps there. They are optimized for different stuff, and will need some more refactoring. A number of these classes are actually generated from templates, similar to what GNU Trove does. The reason is that Java can be quite costly when managing boxed primitives, so it does pay off to have primitive versions. (yes, there are plans to split this out into a separate library. It's just not of high priority.)

Note that ELKI deliberately does not endorse the java.util.Collections API. We have found in particular the java.util.Iterator class to be quite costly, and thus try to encourage people to use C++-style iterators throughout ELKI:

for (Iter iter = ids.iter(); iter.valid(); iter.advance()) {

often saves a lot of unnecessary object creations over the java.util.Iterator API. Plus, these iterators can have multiple (and primitive) value getters; where Iterator.next() is a mixture of a getter and the advance operator.

Ok, I have drifted off too much now, back to the topic of 4-ary heaps:

If you intend to try out 4-ary heaps, I suggest you start with the ObjectHeap class there.

Update: I've been microbenchmarking, but the results so far are inconclusive. It's hard to beat PriorityQueue consistently. In particular bulk loading and bulk repairs do not seem to cut anything in my benchmark - probably they cause HotSpot to optimize less, or de-optimize at some point. As often, a simpler Java code is faster than a complex logic. So far, 4-ary heaps without bulk-loading seem to work best. I haven't tried 5-ary yet. 3-ary are about equal with 4-ary heaps; and the memory layout of 4-ary is a bit nicer. I'm also considering to try a heap-of-heaps approach to safe array resizing. But I expect that the increased code complexity means it will run slower in practise.

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

As per @ErichSchubert's suggestion, I have taken the implementations from ELKI and modified them into a 4-ary heap. It was a bit trick to get the indexing right, as a lot of the publications around 4-ary heaps use formulas for 1-indexed arrays?!?

Here are some early benchmark results, based on the ELKI unit test. 200000 Double objects are preallocated (to avoid measuring memory management too much) and shuffled.

As a warmup, 10 iterations are performed for each heap, for benchmarking 100 iterations, but I'll probably try to scale this up further. 10-30 seconds isn't that realiably for benchmarking yet, and OTOH I should try to measure standard deviations, too. In each iteration, the 200000 elements are added to the heap, then half of them are polled again. Yes, the workload could also be made more complex.

Here are the results:

  • My 4-ary DoubleMinHeap: 10.371
  • ELKI DoubleMinHeap: 12.356
  • ELKI Heap<Double>: 37.458
  • Java PriorityQueue<Double>: 45.875

So the difference between the 4-ary heap (probably not yet L1 cache-aligned!) and the ELKI heap for primitive doubles is not too big. Well, 10%-20% or so; it could be worse.

The difference between a heap for primitive doubles and a heap for Double objects is much larger. And the ELKI Heap is indeed quite clearly faster than the Java PriorityQueue (but that one seems to have a high variance). There was a slight "bug" in ELKI, though - at least the primitive heaps did not use the bulk loading code yet. It's there, it's just not being used, as every elements repairs the heap immediately instead of delaying this until the next poll(). I fixed this for my experiments, essentially by removing a few lines and adding one ensureValid(); call. Furthermore, I also don't have a 4-ary object heap yet, and I havn't include ELKI's DoubleObjectMinHeap yet... quite a lot to benchmark, and I'll probably give caliper a try for that.

Upvotes: 2

mikera
mikera

Reputation: 106351

Not benchmarked it myself, but have a few points to make that are relevant.

Firstly, note that the standard Java implementation of PriorityQueue uses a binary heap:

It is plausibly the case that, despite the cache locality benefit of n-ary heaps, binary heaps are still the best solution on average. Below are some slightly hand-wavy reasons why this might be the case:

  • For most interesting objects, comparison costs are probably much more significant than cache locality effects in the heap data structure itself. n-ary heaps require more comparisons. This probably is enough on its own to outweigh any cache locality effect in the heap itself.
  • If you were simply making a heap of numbers in place (i.e. backed by an array of ints or doubles) then I can see that the chache locality would be a worthwhile benefit. But this isn't the case: usually you will have a heap of object references. Cache locality on the object references themselves is then less useful, since each comparison will require following at least one extra reference to examine the referenced object and its fields.
  • The common case for priority heaps is probably quite a small heap. If you are hitting it sufficiently that you care about it from a performance perspective, it's probably all in the L1 cache anyway. So no cache locality benefit for the n-ary heap anyway.
  • It's easier to handle a binary heap with bitwise ops. Sure it's not a big advantage, but every little helps....
  • Simpler algorithms are generally faster than more complex ones, all else being equal, simply because of a lower constant overhead. You get benefits like lower instruction cache usage, higher likelihood of the compiler being able to find smart optimisations etc. Again this works in favour of the binary heap.

Obviously of course you'd need to do your own benchmarks on your own data before you come to a real conclusion about whether which performs the best (and if the difference is enough to care about, which I personally doubt....)

EDIT

Also, I did write a priority heap implementation using an array of primitive keys that may be of interest given the original poster mentioned primitive keys in the comment below:

This could probably be hacked into an n-ary version for benchmarking purposes relatively easily if anyone was interested in running a test.

Upvotes: 1

Related Questions