Reputation: 8281
I want to compose numbers in base 3, represented as fixed-length Array[Byte]
.
Here are some attempts :
val byteBoard = Array.fill(9)(1.toByte)
val cache: Seq[(Int, Int)] = (0 to 8).map(i => (i, math.pow(3d, i.toDouble).toInt))
@Benchmark
def composePow(): Unit = {
val _ = (0 to 8).foldLeft(0) { case (acc, i) => acc + math.pow(3d, i.toDouble).toInt * byteBoard(i) }
}
@Benchmark
def composeCachedPowWithFold(): Unit = {
val _ = cache.foldLeft(0) { case (acc, (i, k)) => acc + k * byteBoard(i).toInt }
}
@Benchmark
def composeCachedPowWithForeach(): Unit = {
var acc = 0
cache.foreach { case (i, k) => acc = acc + k * byteBoard(i)}
}
@Benchmark
def composeUnrolled(): Unit = {
val _ = byteBoard(0) +
3 * byteBoard(1) +
3 * 3 * byteBoard(2) +
3 * 3 * 3 * byteBoard(3) +
3 * 3 * 3 * 3 * byteBoard(4) +
3 * 3 * 3 * 3 * 3 * byteBoard(5) +
3 * 3 * 3 * 3 * 3 * 3 * byteBoard(6) +
3 * 3 * 3 * 3 * 3 * 3 * 3 * byteBoard(7) +
3 * 3 * 3 * 3 * 3 * 3 * 3 * 3 * byteBoard(8)
}
Can you confirm the following conclusion :
composePow
: boxing + type conversions to use math.pow
composeCachedPowWithFold
: boxing because fold is a parameterized methodcomposeCachedPowWithForeach
: no boxing, less idiomatic Scala (local mutation)composeUnrolled
: no boxingAnd explain why 4. is way faster than 3. ?
PS: Here are the result of the JMH Benchmark
[info] IndexBenchmark.composeCachedPowWithFold thrpt 10 7180844,823 ± 1015310,847 ops/s
[info] IndexBenchmark.composeCachedPowWithForeach thrpt 10 14234192,613 ± 1449571,042 ops/s
[info] IndexBenchmark.composePow thrpt 10 1515312,179 ± 34892,700 ops/s
[info] IndexBenchmark.composeUnrolled thrpt 10 152297653,110 ± 2237446,053 ops/s
Upvotes: 1
Views: 204
Reputation: 44957
I mostly agree with your analysis of the cases 1,2,4, but the third variant is really funny!
I agree with you about the first two versions: foldLeft
is not @specialized
, so, yes, there is some boxing-unboxing. But math.pow
is evil for integer arithmetic anyway, and all those conversions only incur additional overhead.
Now let's take a closer look at the third variant. It is so slow because you are constructing a closure over mutable state. Look at output of scala -print
. This is what your method is rewritten into:
private def composeCachedPowWithForeach(): Unit = {
var acc: runtime.IntRef = scala.runtime.IntRef.create(0);
anon$1.this.cache().foreach({
((x0$3: Tuple2) =>
anon$1.this.
$anonfun$composeCachedPowWithForeach$1(acc, x0$3))
})
};
And here is the function used in foreach
:
final <artifact> private[this] def
$anonfun$composeCachedPowWithForeach$1(
acc$1: runtime.IntRef, x0$3: Tuple2
): Unit = {
case <synthetic> val x1: Tuple2 = x0$3;
case4(){
if (x1.ne(null))
{
val i: Int = x1._1$mcI$sp();
val k: Int = x1._2$mcI$sp();
matchEnd3({
acc$1.elem = acc$1.elem.+(k.*(anon$1.this.byteBoard().apply(i)));
scala.runtime.BoxedUnit.UNIT
})
}
else
case5()
};
case5(){
matchEnd3(throw new MatchError(x1))
};
matchEnd3(x: scala.runtime.BoxedUnit){
()
}
};
You see that there is apparently lot of code generated by pattern matching. I'm not sure whether it contributes much to the overhead. What I personally find much more interesting is the runtime.IntRef
part. This is an object that keeps a mutable variable that corresponds to var acc
in your code. Even though it looks like a simple local variable in the code, it must be referenced from the closure somehow, and is therefore wrapped into an object, and evicted to the heap. I assume that accessing this mutable variable on the heap causes most of the overhead.
In contrast to that, if the byteBoard
were passed as an argument, then nothing in the fourth variant would ever leave function's stack frame:
private def composeUnrolled(): Unit = {
val _: Int =
anon$1.this.byteBoard().apply(0).+
(3.*(anon$1.this.byteBoard().apply(1))).+
(9.*(anon$1.this.byteBoard().apply(2))).+
(27.*(anon$1.this.byteBoard().apply(3))).+
(81.*(anon$1.this.byteBoard().apply(4))).+
(243.*(anon$1.this.byteBoard().apply(5))).+
(729.*(anon$1.this.byteBoard().apply(6))).+
(2187.*(anon$1.this.byteBoard().apply(7))).+
(6561.*(anon$1.this.byteBoard().apply(8)));
()
};
There is essentially no control flow to speak of, not really any method invocations (the apply
is for accessing array elements, that doesn't count), and overall it's just one very simple arithmetic operation, that might even fit into registers of your processor. That's why it is so fast.
While you are at it, you might want to benchmark these two methods:
def ternaryToInt5(bytes: Array[Byte]): Int = {
var acc = 0
val n = bytes.size
var i = n - 1
while (i >= 0) {
acc *= 3
acc += bytes(i)
i -= 1
}
acc
}
def ternaryToInt6(bytes: Array[Byte]): Int = {
bytes(0) +
3 * (bytes(1) +
3 * (bytes(2) +
3 * (bytes(3) +
3 * (bytes(4) +
3 * (bytes(5) +
3 * (bytes(6) +
3 * (bytes(7) +
3 * (bytes(8)))))))))
}
Also, if you are working with byte arrays frequently, you might find this syntactic sugar useful.
Upvotes: 1