Simon
Simon

Reputation: 431

Problems with Java Vector API to sum a list of doubles

I implemented three naive sum methods (in scala) which act on an Array[Double] (or double[])

    inline def sum2 =
      var sum: Double = 0.0
      var i: Int = 0
      inline val species = DoubleVector.SPECIES_256

      while i < species.loopBound(vec.length) do        
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)        
        sum = sum + vec2.reduceLanes(VectorOperators.ADD)
        i += species.length()
      end while      
      while i < vec.length do
        sum += vec(i)
        i += 1
      end while
      sum
    end sum2

    inline def sum: Double =
      var sum = 0.0
      var i = 0;
      while i < vec.length do
        sum = sum + vec(i)
        i = i + 1
      end while
      sum


    inline def sum3 =
      var i: Int = 0
      val species = DoubleVector.SPECIES_256

      var acc = DoubleVector.zero(species)

      while i < species.loopBound(vec.length) do
        val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)
        acc =
          acc.add(vec2) // This line, appears to break JMH. It's not clear why as it passes a unit test and compiles.
        i += species.length()
      end while

      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum3

Then setup a benchmark to see if there was a difference.


@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
abstract class BLASBenchmark:


  private final val rand: Random = new Random(0);

  protected def randomDouble(): Double =
    return rand.nextDouble();

  protected def randomDoubleArray(n: Int): Array[Double] =
    val res = new Array[Double](n);

    for i <- 0 until n do res(i) = rand.nextDouble();
    end for
    return res;
  end randomDoubleArray

  protected def randomFloat(): Float =
    return rand.nextFloat();

  protected def randomFloatArray(n: Int): Array[Float] =
    val res = new Array[Float](n);
    for i <- 0 until n do res(i) = rand.nextFloat();
    end for
    return res;
  end randomFloatArray
end BLASBenchmark

@State(Scope.Thread)
class SumBenchmark extends BLASBenchmark:

  @Param(Array("3", "10", "100", "10000", "100000"))
  var len: String = uninitialized;

  var arr: Array[Double] = uninitialized

  @Setup(Level.Trial)
  def setup: Unit =
    arr = randomDoubleArray(len.toInt);
    ()
  end setup

  @Benchmark
  def sum_loop(bh: Blackhole) =
    val r = arr.sum
    bh.consume(r);
  end sum_loop

  @Benchmark
  def sum_vec(bh: Blackhole) =
    val r = arr.sum2
    bh.consume(r);
  end sum_vec

  /**
This doesn't work. I don't understand why.
   */
  @Benchmark
  def sum_vec_alt(bh: Blackhole) =
    val r = arr.sum3
    bh.consume(r);
  end sum_vec_alt

end SumBenchmark

The result of which were as follows (locally)

Benchmark               (len)   Mode  Cnt          Score           Error  Units
SumBenchmark.sum_loop       3  thrpt    3  650060202.104 ± 176059544.674  ops/s
SumBenchmark.sum_loop      10  thrpt    3  422309397.044 ±  19968584.370  ops/s
SumBenchmark.sum_loop     100  thrpt    3   30901788.878 ±    354896.502  ops/s
SumBenchmark.sum_loop   10000  thrpt    3     116283.523 ±      4388.696  ops/s
SumBenchmark.sum_loop  100000  thrpt    3      11487.085 ±       586.552  ops/s
SumBenchmark.sum_vec        3  thrpt    3  607547613.267 ±  25376258.940  ops/s
SumBenchmark.sum_vec       10  thrpt    3   48259596.149 ±   1205351.697  ops/s
SumBenchmark.sum_vec      100  thrpt    3    4604684.325 ±    122538.788  ops/s
SumBenchmark.sum_vec    10000  thrpt    3      45689.056 ±      1872.688  ops/s
SumBenchmark.sum_vec   100000  thrpt    3       4729.203 ±        83.013  ops/s

The way I'm reading this, is that my vectorised version is in fact, significantly slower that the simple while loop.

My questions;

Upvotes: 1

Views: 213

Answers (3)

JavaJ
JavaJ

Reputation: 11

As requested, here's the Java equivalent of Simon's enhanced solution. This is basically the equivalent of the "Vector (1 accumulator)" algorithm in my previous post, but includes Simon's changes to not assign the DoubleVector to a variable, uses a while loop instead of a for loop, etc. to make it as close as possible to Simon's version. It still runs the code 100,000 times to allow comparison of timings with the other algorithms.

// Vector accumulator (original post equivalent)
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
    DoubleVector acc = DoubleVector.zero(dblSpecies);
    int i = 0;
    while (i < loopBound) {
        acc = acc.add(DoubleVector.fromArray(dblSpecies, arr, i));
        i += speciesLength;
    }
    sum += acc.reduceLanes(VectorOperators.ADD);
}
t2 = System.currentTimeMillis();
log.info("Vector (original post) took " + (t2 - t1) + " ms with result " + sum);

The result for this was 274ms, so basically the same as the original "Vector (1 accumulator)" code (within a very small margin of error of ~1.4%).

I don't believe that "not assigning" the DoubleVector to a variable in the code makes any difference... Internally, the result of that instruction is assigned to something - either explicitly in your code, or just implicitly to a stack variable inside the JVM so that the result can be passed into the next function as a parameter. Remember, Java passes objects by reference, so there must be a pointer internally to the result of the DoubleVector.fromArray() method invocation in order to pass that into the acc.add() method. It's not like the actual data is being copied in either case - just the reference to that data.

As a final note, there will be very little (if any) heap object allocation in this code. It looks like there is, because of all the DoubleVector.fromArray() calls inside the hot loop... these return a DoubleVector, which is an object, so you'd think it would allocate on the heap for that. But in reality, the JVM is smart enough to allocate data which doesn't "escape" the enclosing method on the stack, which is super-fast (compared to heap allocation). So in case anyone was wondering, the GC overhead for creating all these objects should be nearly negligible, since it should all be on the stack, not the heap.

Upvotes: 0

JavaJ
JavaJ

Reputation: 11

I've done similar testing in Java (JDK 22), also running on a Mac (M3 Max). In my test, I have an array of 8192 doubles, and I want to calculate the sum. The fastest way I found to do it is to combine the Vector API with using separate accumulators. I haven't benchmarked with JMH, because I'm not familiar with it - I've just timed it running 100,000 times over the array. So the timings from my tests are for summing a total of 819,200,000 doubles for each algorithm tested.

public void test() {
    // Set up array of doubles...
    int num = 8192;
    double[] arr = new double[num];
    for (int i = 0; i < num; i++) {
        arr[i] = Math.random();
    }

    // Calculate the sum using various methods
    final int iterations = 100_000;
    long t1, t2;
    double sum = 0;

    // Linear sum with single accumulator
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < num; i++) {
            sum += arr[i];
        }
    }
    t2 = System.currentTimeMillis();
    log.info("Linear (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);

    // Linear sum with 4 accumulators
    t1 = System.currentTimeMillis();
    sum = 0;
    double sum1 = 0;
    double sum2 = 0;
    double sum3 = 0;
    double sum4 = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < num; i += 4) {
            sum1 += arr[i];
            sum2 += arr[i + 1];
            sum3 += arr[i + 2];
            sum4 += arr[i + 3];
        }
    }
    sum = sum1 + sum2 + sum3 + sum4;
    t2 = System.currentTimeMillis();
    log.info("Linear (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);


    VectorSpecies<Double> dblSpecies = DoubleVector.SPECIES_PREFERRED;
    int speciesLength = dblSpecies.length();
    int loopBound = dblSpecies.loopBound(num);

    // Vector ReduceLanes
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        for (int i = 0; i < loopBound; i += speciesLength) {
            DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
            sum += dv.reduceLanes(VectorOperators.ADD);
        }
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (reduceLanes) took " + (t2 - t1) + " ms with result " + sum);

    // Vector Accumulator
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        DoubleVector acc = DoubleVector.zero(dblSpecies);
        for (int i = 0; i < loopBound; i += speciesLength) {
            DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
            acc = acc.add(dv);
        }
        sum += acc.reduceLanes(VectorOperators.ADD);
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);


    // Hybrid - Vector accumulator with multiple linear accumulators
    t1 = System.currentTimeMillis();
    sum = 0;
    for (int iLoop = 0; iLoop < iterations; iLoop++) {
        // Break into 4 arrays...
        int partLen = num / 4;
        int s1 = 0;
        int s2 = partLen;
        int s3 = 2 * partLen;
        int s4 = 3 * partLen;
        int partLoopBound = dblSpecies.loopBound(partLen);
        DoubleVector acc1 = DoubleVector.zero(dblSpecies);
        DoubleVector acc2 = DoubleVector.zero(dblSpecies);
        DoubleVector acc3 = DoubleVector.zero(dblSpecies);
        DoubleVector acc4 = DoubleVector.zero(dblSpecies);
        for (int i = 0; i < partLoopBound; i += speciesLength) {
            DoubleVector dv1 = DoubleVector.fromArray(dblSpecies, arr, i + s1);
            acc1 = acc1.add(dv1);
            DoubleVector dv2 = DoubleVector.fromArray(dblSpecies, arr, i + s2);
            acc2 = acc2.add(dv2);
            DoubleVector dv3 = DoubleVector.fromArray(dblSpecies, arr, i + s3);
            acc3 = acc3.add(dv3);
            DoubleVector dv4 = DoubleVector.fromArray(dblSpecies, arr, i + s4);
            acc4 = acc4.add(dv4);
        }
        sum += acc1.reduceLanes(VectorOperators.ADD)
                + acc2.reduceLanes(VectorOperators.ADD)
                + acc3.reduceLanes(VectorOperators.ADD)
                + acc4.reduceLanes(VectorOperators.ADD);
    }
    t2 = System.currentTimeMillis();
    log.info("Vector (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);
}

The first time this test method is called, I get results:

Linear (1 accumulator) took 573 ms with result 4.098820819868104E8
Linear (4 accumulators) took 150 ms with result 4.0988208200299764E8
Vector (reduceLanes) took 290 ms with result 4.098820819863893E8
Vector (1 accumulator) took 1691 ms with result 4.0988208200494045E8
Vector (4 accumulators) took 1174 ms with result 4.0988208200494045E8

By the 4th or 5th time I call it, I assume the JVM has optimised the bytecode and I get more sensible timings for the last couple of Vector API calls:

Linear (1 accumulator) took 605 ms with result 4.101329885212625E8
Linear (4 accumulators) took 142 ms with result 4.10132988510815E8
Vector (reduceLanes) took 290 ms with result 4.1013298850172895E8
Vector (1 accumulator) took 279 ms with result 4.1013298851254654E8
Vector (4 accumulators) took 71 ms with result 4.1013298851254654E8

My takeaways from this test are:

  • The JVM seems to optimise the code to make best use of SIMD instructions after the code has been invoked lots of times. This is particularly evident when using Vector API add operations. I'm curious to know why, given the Vector API is designed to optimise the byte code from the initial compilation (e.g. it has @ForceInline all over the place).
  • Using a linear sum with multiple accumulators is faster than a simple Vector API implementation (at least on Apple silicon where the species length is limited to 128 bits, so it can only store 2 doubles).
  • For the Vector API, using an accumulator is slightly faster than reduceLanes (i.e. it takes ~96% of the time). This makes sense to me, because even with "only" 128 bits, my understanding is that it's adding 2 doubles simultaneously. Whereas reduceLanes with only 2 values is effectively adding 1 double to another in sequence. It's faster than a linear sum with a single accumulator (46%)... but it's still slower than doing a linear sum using 4 accumulators (196%)
  • The fastest way to sum a double array is to combine these techniques into a hybrid approach, where we use the Vector API to accumulate the values, but also use multiple accumulators (12% compared to linear with 1 accumulator, and 50% compared to linear with the same number (4) of accumulators).

I haven't been able to run this test on any other hardware, but my working assumption is that for CPUs with a larger "species size" (e.g. 256 or 512 bit vectors) the gain by using the Vector API would be greater than I found in my results. But even with 128 bit vectors, it's still worth using the Vector API - as long as you combine it with the multiple accumulators technique. This can give you nearly a 10x speedup compared to a naive linear sum.

A couple of other final notes:

  • The code above doesn't deal with trailing values where the array length isn't exactly divisible by the species size (or the number of accumulators). I expect that adding this should have a negligible impact on the speed. This is why I chose an array size of 8192, because it's divisible by both species length (2) and then again by the number of accumulators (4), so I'd still get accurate results.
  • Although my test code takes things like the species length out of the hot loop, I couldn't replicate your results in this area... putting it inside the hot loop makes little/no difference for me. I suspect that the JVM optimises this stuff anyway, or at least it does after it's run several times. I'm running each algorithm 10,000 times for each test, which may explain why I didn't see any noticeable difference.

In terms of throughput, the "best" algorithm here is summing 819,200,000 doubles in 71ms, which is ~11.5 billion values per second (compared to ~1.4 billion per second using a naive sum). Your mileage will probably vary significantly, depending on your CPU.

Upvotes: 1

Simon
Simon

Reputation: 431

I believe the algothim I used, was indeed naive.

    inline def sum: Double =
      var i: Int = 0

      var acc = DoubleVector.zero(Matrix.doubleSpecies)
      val sp = Matrix.doubleSpecies
      val l = sp.length()

      while i < sp.loopBound(vec.length) do
        acc = acc.add(DoubleVector.fromArray(Matrix.doubleSpecies, vec, i))
        i += l
      end while
      var temp = acc.reduceLanes(VectorOperators.ADD)
      // var temp = 0.0
      while i < vec.length do
        temp += vec(i)
        i += 1
      end while
      temp
    end sum

With the key changes being;

  • remove the length method and calls into Specires from the hot loop
  • don't allocate an intermediate data structure - allocate straight back into the accumulator
  • accumulate along lanes, then reduce at the end point
  • select the species as "species preferred"

This provides a more satisfactory result.

umBenchmark.sum_loop                  3     N/A  thrpt    3  454987372.518 ±  4974554.176  ops/s
SumBenchmark.sum_loop                100     N/A  thrpt    3   22881036.167 ±   429259.728  ops/s
SumBenchmark.sum_loop             100000     N/A  thrpt    3      10733.807 ±       65.014  ops/s
SumBenchmark.sum_vec                   3     N/A  thrpt    3  454811646.671 ± 12166349.908  ops/s
SumBenchmark.sum_vec                 100     N/A  thrpt    3   40353921.106 ±   547651.178  ops/s
SumBenchmark.sum_vec              100000     N/A  thrpt    3      40141.621 ±      714.686  ops/s
SumBenchmark.sum_vec_alt               3     N/A  thrpt    3  505591602.643 ± 36682920.662  ops/s
SumBenchmark.sum_vec_alt             100     N/A  thrpt    3  110847968.173 ±  1363183.523  ops/s
SumBenchmark.sum_vec_alt          100000     N/A  thrpt    3      42821.901 ±      244.293  ops/s

For very large vectors, a factor of 4x (or so).

Upvotes: 2

Related Questions