How to avoid boxing when composing numbers in base 3 in Scala

Question

I want to compose numbers in base 3, represented as fixed-length Array[Byte].

Here are some attempts :

val byteBoard = Array.fill(9)(1.toByte)

val cache: Seq[(Int, Int)] = (0 to 8).map(i => (i, math.pow(3d, i.toDouble).toInt))

@Benchmark
def composePow(): Unit = {
  val _ = (0 to 8).foldLeft(0) { case (acc, i) => acc + math.pow(3d, i.toDouble).toInt * byteBoard(i) }
}

@Benchmark
def composeCachedPowWithFold(): Unit = {
  val _ = cache.foldLeft(0) { case (acc, (i, k)) => acc + k * byteBoard(i).toInt }
}

@Benchmark
def composeCachedPowWithForeach(): Unit = {
  var acc = 0
  cache.foreach { case (i, k) => acc = acc + k * byteBoard(i)}
}

@Benchmark
def composeUnrolled(): Unit = {
  val _ = byteBoard(0) +
    3 * byteBoard(1) +
    3 * 3 * byteBoard(2) +
    3 * 3 * 3 * byteBoard(3) +
    3 * 3 * 3 * 3 * byteBoard(4) +
    3 * 3 * 3 * 3 * 3 * byteBoard(5) +
    3 * 3 * 3 * 3 * 3 * 3 * byteBoard(6) +
    3 * 3 * 3 * 3 * 3 * 3 * 3 * byteBoard(7) +
    3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * byteBoard(8)
}

Can you confirm the following conclusion :

composePow: boxing + type conversions to use math.pow
composeCachedPowWithFold: boxing because fold is a parameterized method
composeCachedPowWithForeach: no boxing, less idiomatic Scala (local mutation)
composeUnrolled: no boxing

And explain why 4. is way faster than 3. ?

PS: Here are the result of the JMH Benchmark

[info] IndexBenchmark.composeCachedPowWithFold                    thrpt   10    7180844,823 ± 1015310,847  ops/s
[info] IndexBenchmark.composeCachedPowWithForeach                 thrpt   10   14234192,613 ± 1449571,042  ops/s
[info] IndexBenchmark.composePow                                  thrpt   10    1515312,179 ±   34892,700  ops/s
[info] IndexBenchmark.composeUnrolled                             thrpt   10  152297653,110 ± 2237446,053  ops/s

Andrey Tyukin · Accepted Answer

I mostly agree with your analysis of the cases 1,2,4, but the third variant is really funny!

I agree with you about the first two versions: foldLeft is not @specialized, so, yes, there is some boxing-unboxing. But math.pow is evil for integer arithmetic anyway, and all those conversions only incur additional overhead.

Now let's take a closer look at the third variant. It is so slow because you are constructing a closure over mutable state. Look at output of scala -print. This is what your method is rewritten into:

private def composeCachedPowWithForeach(): Unit = {
  var acc: runtime.IntRef = scala.runtime.IntRef.create(0);
  anon$1.this.cache().foreach({
    ((x0$3: Tuple2) => 
      anon$1.this.
        $anonfun$composeCachedPowWithForeach$1(acc, x0$3))
  })
};

And here is the function used in foreach:

final  private[this] def 
$anonfun$composeCachedPowWithForeach$1(
   acc$1: runtime.IntRef, x0$3: Tuple2
): Unit = {
  case  val x1: Tuple2 = x0$3;
  case4(){
    if (x1.ne(null))
      {
        val i: Int = x1._1$mcI$sp();
        val k: Int = x1._2$mcI$sp();
        matchEnd3({
          acc$1.elem = acc$1.elem.+(k.*(anon$1.this.byteBoard().apply(i)));
          scala.runtime.BoxedUnit.UNIT
        })
      }
    else
      case5()
  };
  case5(){
    matchEnd3(throw new MatchError(x1))
  };
  matchEnd3(x: scala.runtime.BoxedUnit){
    ()
  }
};

You see that there is apparently lot of code generated by pattern matching. I'm not sure whether it contributes much to the overhead. What I personally find much more interesting is the runtime.IntRef part. This is an object that keeps a mutable variable that corresponds to var acc in your code. Even though it looks like a simple local variable in the code, it must be referenced from the closure somehow, and is therefore wrapped into an object, and evicted to the heap. I assume that accessing this mutable variable on the heap causes most of the overhead.

In contrast to that, if the byteBoard were passed as an argument, then nothing in the fourth variant would ever leave function's stack frame:

private def composeUnrolled(): Unit = {
  val _: Int = 
    anon$1.this.byteBoard().apply(0).+  
    (3.*(anon$1.this.byteBoard().apply(1))).+
    (9.*(anon$1.this.byteBoard().apply(2))).+
    (27.*(anon$1.this.byteBoard().apply(3))).+
    (81.*(anon$1.this.byteBoard().apply(4))).+
    (243.*(anon$1.this.byteBoard().apply(5))).+
    (729.*(anon$1.this.byteBoard().apply(6))).+
    (2187.*(anon$1.this.byteBoard().apply(7))).+
    (6561.*(anon$1.this.byteBoard().apply(8)));
  ()
};

There is essentially no control flow to speak of, not really any method invocations (the apply is for accessing array elements, that doesn't count), and overall it's just one very simple arithmetic operation, that might even fit into registers of your processor. That's why it is so fast.

While you are at it, you might want to benchmark these two methods:

def ternaryToInt5(bytes: Array[Byte]): Int = {
  var acc = 0
  val n = bytes.size
  var i = n - 1
  while (i >= 0) {
    acc *= 3
    acc += bytes(i)
    i -= 1
  }
  acc
}

def ternaryToInt6(bytes: Array[Byte]): Int = {
  bytes(0) + 
  3 * (bytes(1) + 
  3 * (bytes(2) + 
  3 * (bytes(3) + 
  3 * (bytes(4) + 
  3 * (bytes(5) + 
  3 * (bytes(6) + 
  3 * (bytes(7) + 
  3 * (bytes(8)))))))))
}

Also, if you are working with byte arrays frequently, you might find this syntactic sugar useful.

How to avoid boxing when composing numbers in base 3 in Scala

Answers (1)

Related Questions