fbl
fbl

Reputation: 2920

Scala parallel collection runtime puzzling

Edit: My sample size was too small. When I ran it against the real data on 8 CPU's, I saw a 7.2x speed increase. Not too shabby for adding 4 characters to my code ;)

I am currently in the process of trying to 'sell' management on the benefits of using Scala, especially when it comes to scaling with CPU's. To that end, I created a simple test application that does a bunch of vector math and was a bit surprised to find that the runtime was not noticably better on my quad-core machine. Interestingly enough, I found that the runtime is the worst the first time you go through the collection and gets better with subsequent calls. Are there some lazy things in the parallel collection that are causing this, or am I just doing this wrong? It should be noted that I come from the C++/C# world, so it's entirely possible that I have messed up my configuration somehow. Regardless, here's my setup:

InteliJ Scala Plugin

Scala 2.9.1.final

Windows 7 64 bit, Quad-Core Processor (no hyperthreading)

import util.Random

  // simple Vector3D class that has final x,y,z components a length, and a '-' function
  class Vector3D(val x:Double,  val y:Double, val z:Double)
  {
    def length = math.sqrt(x*x+y*y+z*z)
    def -(rhs : Vector3D ) = new Vector3D(x - rhs.x, y - rhs.y, z - rhs.z)
  }

object MainClass {

  def main(args : Array[String]) =
  {
    println("Available CPU's: " + Runtime.getRuntime.availableProcessors())
    println("Parallelism Degree set to: " + collection.parallel.ForkJoinTasks.defaultForkJoinPool.getParallelism);
    // my position
    val myPos = new Vector3D(0,0,0);

    val r = new Random(0);

    // define a function nextRand that gets us a random between 0 and 100
    def nextRand = r.nextDouble() * 100;

    // make 10 million random targets
    val targets = (0 until 10000000).map(_ => new Vector3D(nextRand, nextRand, nextRand)).toArray
    // take the .par hit before we start profiling
    val parTargets = targets.par

    println("Created " + targets.length + " vectors")

    // define a range function
    val rangeFunc : (Vector3D => Double) = (targetPos) => (targetPos - myPos).length

    // we'll select ones that are <50
    val within50 : (Vector3D => Boolean) = (targetPos) => rangeFunc(targetPos) < 50

    // time it sequentially
    val startTime_sequential = System.currentTimeMillis()
    val numTargetsInRange_sequential = targets.filter(within50)
    val endTime_sequential = System.currentTimeMillis()
    println("Sequential (ms): " + (endTime_sequential - startTime_sequential))

    // do the parallel version 10 times
    for(i <- 1 to 10)
    {

      val startTime_par = System.currentTimeMillis()
      val numTargetsInRange_parallel = parTargets.filter(within50)
      val endTime_par = System.currentTimeMillis()

      val ms = endTime_par - startTime_par;
      println("Iteration[" + i + "] Executed in " + ms + " ms")
    }
  }
}

The output of this program is:

Available CPU's: 4
Parallelism Degree set to: 4
Created 10000000 vectors
Sequential (ms): 216
Iteration[1] Executed in 227 ms
Iteration[2] Executed in 253 ms
Iteration[3] Executed in 76 ms
Iteration[4] Executed in 78 ms
Iteration[5] Executed in 77 ms
Iteration[6] Executed in 80 ms
Iteration[7] Executed in 78 ms
Iteration[8] Executed in 78 ms
Iteration[9] Executed in 79 ms
Iteration[10] Executed in 82 ms

So what's going on here? The first 2 times we do the filter, it's slower, and then things speed up? I understand that there will inherently be a parallelism startup cost, I'm just trying to figure out where it makes sense to express the parallelism in my applicaion, and specifically I want to be able to show Management a program that runs 3-4 times faster on a Quad core box. Is this just not a good problem?

Ideas?

Upvotes: 9

Views: 979

Answers (4)

Jed Wesley-Smith
Jed Wesley-Smith

Reputation: 4706

You have the micro-benchmark disease. You are most likely benchmarking the JIT compile phase. You'll need to warm up your JIT with a pre-run first.

Probably the best idea is to use a micro-benchmarking framework like http://code.google.com/p/caliper/ which handles all that for you.

Edit: There is a nice SBT Template for Caliper benchmarking Scala projects, as referenced from this blog post

Upvotes: 11

maasg
maasg

Reputation: 37435

Next to the JIT optimizations mentioned before, a key concept you need to evaluate is whether your problem leans itself to parallelization: There's an inherent cost of split, thread coordination and join that weights against the advantage of doing things in parallel. Scala hides this complexity from you, but you do need to know when to apply this for good results.

In your case, although you're performing a huge amount of operations, each operation on its own is almost CPU trivial. To see the parallel collections in action, try an operation that is heavy on an unit basis.

For a similar Scala presentation, I used a simple (innefficient) algo to calculate whether a number is a prime: def isPrime(x:Int) = (2 to x/2).forall(y=>x%y!=0)

Then use the same logic you presented to determine the numbers in a collection that are prime:

val col = 1 to 1000000
col.filter(isPrime(_))  // sequential
col.par.filter(isPrime(_)) // parallel

The CPU behaviour really showed the difference between both: prime numbers: sequential vs parallel

The time was about 3.5x better for the parallel collections in a 4-core CPU.

Upvotes: 3

huynhjl
huynhjl

Reputation: 41646

Things do speed up, but that has nothing to do with parallel versus sequential, you are not comparing apples to apples. The JVM has a JIT (just in time) compiler that will compile some byte code only after the code is used a certain number of times. So what you see in the first iterations is slower execution for code that is not yet JIT-ed as well as time for ongoing JIT compilation itself. Removing the .par so that it's all sequential here is what I see on my machine (10x less iteration cause I'm using an older machine):

Sequential (ms): 312
Iteration[1] Executed in 117 ms
Iteration[2] Executed in 112 ms
Iteration[3] Executed in 112 ms
Iteration[4] Executed in 112 ms
Iteration[5] Executed in 114 ms
Iteration[6] Executed in 113 ms
Iteration[7] Executed in 113 ms
Iteration[8] Executed in 117 ms
Iteration[9] Executed in 113 ms
Iteration[10] Executed in 111 ms

But it's all sequential! You can see what the JVM does in terms of JIT by using the JVM -XX:+PrintCompilation (set in JAVA_OPTS or use -J-XX:+PrintCompilation scala option. In the first iterations you will see a large numbers of JVM print statements showing what's being JIT-ed, then it stabilizes later on.

So to compare apples to apples, you first run without the par, then add the par and run the same program. On my dual core, when using .par I get:

Sequential (ms): 329
Iteration[1] Executed in 197 ms
Iteration[2] Executed in 60 ms
Iteration[3] Executed in 57 ms
Iteration[4] Executed in 58 ms
Iteration[5] Executed in 59 ms
Iteration[6] Executed in 73 ms
Iteration[7] Executed in 56 ms
Iteration[8] Executed in 60 ms
Iteration[9] Executed in 58 ms
Iteration[10] Executed in 57 ms

So more or less a 2x speedup once it's stable.

On related note, the other thing you want to be careful with is boxing and un-boxing, especially if you are comparing to just Java. The scala library high order functions like filter are doing boxing and un-boxing of primitive types and this is typically source of initial disappointment for those that convert code from Java to Scala.

Though it does not apply in this case as the for is outside the timing, there is also some cost for using for instead of while, but the 2.9.1 compiler should be doing a decent job when using the -optimize scalac flag.

Upvotes: 7

Kim Stebel
Kim Stebel

Reputation: 42045

how about

val numTargetsInRange_sequential = parTargets.filter(within50)

?

Also, you will probably get more impressive results with a map rather than a filter operation.

Upvotes: 0

Related Questions