Connor Doyle
Connor Doyle

Reputation: 1912

Performance of immutable set implementations in Scala

I have recently been diving into Scala and (perhaps predictably) have spent quite a bit of time studying the immutable collections API in the Scala standard library.

I am writing an application that necessarily does many +/- operations on large sets. For this reason, I want to ensure that the implementation I choose is a so-called "persistent" data structure so that I avoid doing copy-on-write. I saw this answer by Martin Odersky, but it didn't really clear up the issue for me.

I wrote the following test code to compare the performance of ListSet and HashSet for add operations:

import scala.collection.immutable._

object TestListSet extends App {
  var set = new ListSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

object TestHashSet extends App {
  var set = new HashSet[Int]
  for(i <- 0 to 100000) {
    set += i
  }
}

Here is a rough runtime measurement of the HashSet:

$ time scala TestHashSet

real    0m0.955s
user    0m1.192s
sys     0m0.147s

And ListSet:

$ time scala TestListSet

real    0m30.516s
user    0m30.612s
sys     0m0.168s

Cons on a singly linked list is a constant-time operation, but this performance looks linear or worse. Is this performance hit related to the need to check each element of the set for object equality to conform to the no-duplicates invariant of Set? If this is the case, I realize it's not related to "persistence".

As for official documentation, all I could find was the following page, but it seems incomplete: Scala 2.8 Collections API -- Performance Characteristics. Since ListSet seems initially to be a good choice for its memory footprint, maybe there should be some information about its performance in the API docs.

Upvotes: 12

Views: 5689

Answers (3)

Nicholas
Nicholas

Reputation: 2205

An old question but also a good example of conclusions being drawn on the wrong foundation.

Connor, basically you're trying to do a microbenchmark. That is generally not recommended and damn hard to do properly.

Why? Because the JVM is doing many other things than executing the code in your examples. It's loading classes, doing garbage collection, compiling bytecode to native code, etc. All dynamically and based on different metrics sampled at runtime.

So you cannot conclude anything about the performance of the two collections with the above test code. For example, what you could actually be measuring could be the compilation time of the += method of HashSet and garbage collection times of ListSet. So it's a comparison between apples and pears.

To do a micro benchmark properly, you should:

  1. Warm up the JVM: Load all classes, ensure all code paths in the benchmark are run and hot spots in the code are compiled (e.g. the += method).
  2. Run the benchmark and ensure neither the GC or the compiler runs during the test (use the JVM flags -XX:-PrintCompilation and -XX:-PrintGC). If either runs during the test, discard the result.
  3. Repeat step 2 and sample 10-15 good measurements. Calculate variance and standard deviation.
  4. Evaluate: If the mean of each benchmark +/- 3 std do not overlap, then you can draw a conclusion about which is faster. Otherwise, it's a blurry result (depending on the amount of overlap).

I can recommend reading Oracle's recommendations for doing micro benchmarks and a great article about benchmark pitfalls by Brian Goetz.

Also, if you want to use a good tool, which does all the above for you, try Caliper by Google.

Upvotes: 10

Dylan
Dylan

Reputation: 13924

Since a set has to have with no duplicates, before adding an element, a Set must check to see if it already contains the element. This search in a list that has no guarantee of an element's position will be O(N) linear time. The same general idea applies to its remove operation.

With a HashSet, the class defines a function that picks a location for any element in O(1), which makes the contains(element) method much quicker, at the expense of taking up more space to decrease the chance of element location collisions.

Upvotes: 6

Rex Kerr
Rex Kerr

Reputation: 167891

The key line from the ListSet source is (within subclass Node):

override def +(e: A): ListSet[A] = if (contains(e)) this else new Node(e)

where you can see that an item is added only if it is not already contained. So adding to the set is O(n). You can generally assume that XMap has similar performance characteristics to XSet, and ListMap is listed as linear time all the way along. This is why, and it is how a set is supposed to behave.

P.S. In the TestHashSet case you're measuring startup time. It's way more than 30x faster.

Upvotes: 9

Related Questions