Reputation: 431
I implemented three naive sum methods (in scala) which act on an Array[Double] (or double[])
inline def sum2 =
var sum: Double = 0.0
var i: Int = 0
inline val species = DoubleVector.SPECIES_256
while i < species.loopBound(vec.length) do
val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)
sum = sum + vec2.reduceLanes(VectorOperators.ADD)
i += species.length()
end while
while i < vec.length do
sum += vec(i)
i += 1
end while
sum
end sum2
inline def sum: Double =
var sum = 0.0
var i = 0;
while i < vec.length do
sum = sum + vec(i)
i = i + 1
end while
sum
inline def sum3 =
var i: Int = 0
val species = DoubleVector.SPECIES_256
var acc = DoubleVector.zero(species)
while i < species.loopBound(vec.length) do
val vec2: DoubleVector = DoubleVector.fromArray(species, vec, i)
acc =
acc.add(vec2) // This line, appears to break JMH. It's not clear why as it passes a unit test and compiles.
i += species.length()
end while
var temp = acc.reduceLanes(VectorOperators.ADD)
// var temp = 0.0
while i < vec.length do
temp += vec(i)
i += 1
end while
temp
end sum3
Then setup a benchmark to see if there was a difference.
@BenchmarkMode(Array(Mode.Throughput))
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Thread)
@Fork(value = 1)
@Warmup(iterations = 3)
@Measurement(iterations = 3)
abstract class BLASBenchmark:
private final val rand: Random = new Random(0);
protected def randomDouble(): Double =
return rand.nextDouble();
protected def randomDoubleArray(n: Int): Array[Double] =
val res = new Array[Double](n);
for i <- 0 until n do res(i) = rand.nextDouble();
end for
return res;
end randomDoubleArray
protected def randomFloat(): Float =
return rand.nextFloat();
protected def randomFloatArray(n: Int): Array[Float] =
val res = new Array[Float](n);
for i <- 0 until n do res(i) = rand.nextFloat();
end for
return res;
end randomFloatArray
end BLASBenchmark
@State(Scope.Thread)
class SumBenchmark extends BLASBenchmark:
@Param(Array("3", "10", "100", "10000", "100000"))
var len: String = uninitialized;
var arr: Array[Double] = uninitialized
@Setup(Level.Trial)
def setup: Unit =
arr = randomDoubleArray(len.toInt);
()
end setup
@Benchmark
def sum_loop(bh: Blackhole) =
val r = arr.sum
bh.consume(r);
end sum_loop
@Benchmark
def sum_vec(bh: Blackhole) =
val r = arr.sum2
bh.consume(r);
end sum_vec
/**
This doesn't work. I don't understand why.
*/
@Benchmark
def sum_vec_alt(bh: Blackhole) =
val r = arr.sum3
bh.consume(r);
end sum_vec_alt
end SumBenchmark
The result of which were as follows (locally)
Benchmark (len) Mode Cnt Score Error Units
SumBenchmark.sum_loop 3 thrpt 3 650060202.104 ± 176059544.674 ops/s
SumBenchmark.sum_loop 10 thrpt 3 422309397.044 ± 19968584.370 ops/s
SumBenchmark.sum_loop 100 thrpt 3 30901788.878 ± 354896.502 ops/s
SumBenchmark.sum_loop 10000 thrpt 3 116283.523 ± 4388.696 ops/s
SumBenchmark.sum_loop 100000 thrpt 3 11487.085 ± 586.552 ops/s
SumBenchmark.sum_vec 3 thrpt 3 607547613.267 ± 25376258.940 ops/s
SumBenchmark.sum_vec 10 thrpt 3 48259596.149 ± 1205351.697 ops/s
SumBenchmark.sum_vec 100 thrpt 3 4604684.325 ± 122538.788 ops/s
SumBenchmark.sum_vec 10000 thrpt 3 45689.056 ± 1872.688 ops/s
SumBenchmark.sum_vec 100000 thrpt 3 4729.203 ± 83.013 ops/s
The way I'm reading this, is that my vectorised version is in fact, significantly slower that the simple while loop.
My questions;
Upvotes: 1
Views: 213
Reputation: 11
As requested, here's the Java equivalent of Simon's enhanced solution. This is basically the equivalent of the "Vector (1 accumulator)" algorithm in my previous post, but includes Simon's changes to not assign the DoubleVector to a variable, uses a while loop instead of a for loop, etc. to make it as close as possible to Simon's version. It still runs the code 100,000 times to allow comparison of timings with the other algorithms.
// Vector accumulator (original post equivalent)
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
DoubleVector acc = DoubleVector.zero(dblSpecies);
int i = 0;
while (i < loopBound) {
acc = acc.add(DoubleVector.fromArray(dblSpecies, arr, i));
i += speciesLength;
}
sum += acc.reduceLanes(VectorOperators.ADD);
}
t2 = System.currentTimeMillis();
log.info("Vector (original post) took " + (t2 - t1) + " ms with result " + sum);
The result for this was 274ms, so basically the same as the original "Vector (1 accumulator)" code (within a very small margin of error of ~1.4%).
I don't believe that "not assigning" the DoubleVector to a variable in the code makes any difference... Internally, the result of that instruction is assigned to something - either explicitly in your code, or just implicitly to a stack variable inside the JVM so that the result can be passed into the next function as a parameter. Remember, Java passes objects by reference, so there must be a pointer internally to the result of the DoubleVector.fromArray() method invocation in order to pass that into the acc.add() method. It's not like the actual data is being copied in either case - just the reference to that data.
As a final note, there will be very little (if any) heap object allocation in this code. It looks like there is, because of all the DoubleVector.fromArray() calls inside the hot loop... these return a DoubleVector, which is an object, so you'd think it would allocate on the heap for that. But in reality, the JVM is smart enough to allocate data which doesn't "escape" the enclosing method on the stack, which is super-fast (compared to heap allocation). So in case anyone was wondering, the GC overhead for creating all these objects should be nearly negligible, since it should all be on the stack, not the heap.
Upvotes: 0
Reputation: 11
I've done similar testing in Java (JDK 22), also running on a Mac (M3 Max). In my test, I have an array of 8192 doubles, and I want to calculate the sum. The fastest way I found to do it is to combine the Vector API with using separate accumulators. I haven't benchmarked with JMH, because I'm not familiar with it - I've just timed it running 100,000 times over the array. So the timings from my tests are for summing a total of 819,200,000 doubles for each algorithm tested.
public void test() {
// Set up array of doubles...
int num = 8192;
double[] arr = new double[num];
for (int i = 0; i < num; i++) {
arr[i] = Math.random();
}
// Calculate the sum using various methods
final int iterations = 100_000;
long t1, t2;
double sum = 0;
// Linear sum with single accumulator
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
for (int i = 0; i < num; i++) {
sum += arr[i];
}
}
t2 = System.currentTimeMillis();
log.info("Linear (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);
// Linear sum with 4 accumulators
t1 = System.currentTimeMillis();
sum = 0;
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
double sum4 = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
for (int i = 0; i < num; i += 4) {
sum1 += arr[i];
sum2 += arr[i + 1];
sum3 += arr[i + 2];
sum4 += arr[i + 3];
}
}
sum = sum1 + sum2 + sum3 + sum4;
t2 = System.currentTimeMillis();
log.info("Linear (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);
VectorSpecies<Double> dblSpecies = DoubleVector.SPECIES_PREFERRED;
int speciesLength = dblSpecies.length();
int loopBound = dblSpecies.loopBound(num);
// Vector ReduceLanes
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
for (int i = 0; i < loopBound; i += speciesLength) {
DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
sum += dv.reduceLanes(VectorOperators.ADD);
}
}
t2 = System.currentTimeMillis();
log.info("Vector (reduceLanes) took " + (t2 - t1) + " ms with result " + sum);
// Vector Accumulator
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
DoubleVector acc = DoubleVector.zero(dblSpecies);
for (int i = 0; i < loopBound; i += speciesLength) {
DoubleVector dv = DoubleVector.fromArray(dblSpecies, arr, i);
acc = acc.add(dv);
}
sum += acc.reduceLanes(VectorOperators.ADD);
}
t2 = System.currentTimeMillis();
log.info("Vector (1 accumulator) took " + (t2 - t1) + " ms with result " + sum);
// Hybrid - Vector accumulator with multiple linear accumulators
t1 = System.currentTimeMillis();
sum = 0;
for (int iLoop = 0; iLoop < iterations; iLoop++) {
// Break into 4 arrays...
int partLen = num / 4;
int s1 = 0;
int s2 = partLen;
int s3 = 2 * partLen;
int s4 = 3 * partLen;
int partLoopBound = dblSpecies.loopBound(partLen);
DoubleVector acc1 = DoubleVector.zero(dblSpecies);
DoubleVector acc2 = DoubleVector.zero(dblSpecies);
DoubleVector acc3 = DoubleVector.zero(dblSpecies);
DoubleVector acc4 = DoubleVector.zero(dblSpecies);
for (int i = 0; i < partLoopBound; i += speciesLength) {
DoubleVector dv1 = DoubleVector.fromArray(dblSpecies, arr, i + s1);
acc1 = acc1.add(dv1);
DoubleVector dv2 = DoubleVector.fromArray(dblSpecies, arr, i + s2);
acc2 = acc2.add(dv2);
DoubleVector dv3 = DoubleVector.fromArray(dblSpecies, arr, i + s3);
acc3 = acc3.add(dv3);
DoubleVector dv4 = DoubleVector.fromArray(dblSpecies, arr, i + s4);
acc4 = acc4.add(dv4);
}
sum += acc1.reduceLanes(VectorOperators.ADD)
+ acc2.reduceLanes(VectorOperators.ADD)
+ acc3.reduceLanes(VectorOperators.ADD)
+ acc4.reduceLanes(VectorOperators.ADD);
}
t2 = System.currentTimeMillis();
log.info("Vector (4 accumulators) took " + (t2 - t1) + " ms with result " + sum);
}
The first time this test method is called, I get results:
Linear (1 accumulator) took 573 ms with result 4.098820819868104E8
Linear (4 accumulators) took 150 ms with result 4.0988208200299764E8
Vector (reduceLanes) took 290 ms with result 4.098820819863893E8
Vector (1 accumulator) took 1691 ms with result 4.0988208200494045E8
Vector (4 accumulators) took 1174 ms with result 4.0988208200494045E8
By the 4th or 5th time I call it, I assume the JVM has optimised the bytecode and I get more sensible timings for the last couple of Vector API calls:
Linear (1 accumulator) took 605 ms with result 4.101329885212625E8
Linear (4 accumulators) took 142 ms with result 4.10132988510815E8
Vector (reduceLanes) took 290 ms with result 4.1013298850172895E8
Vector (1 accumulator) took 279 ms with result 4.1013298851254654E8
Vector (4 accumulators) took 71 ms with result 4.1013298851254654E8
My takeaways from this test are:
I haven't been able to run this test on any other hardware, but my working assumption is that for CPUs with a larger "species size" (e.g. 256 or 512 bit vectors) the gain by using the Vector API would be greater than I found in my results. But even with 128 bit vectors, it's still worth using the Vector API - as long as you combine it with the multiple accumulators technique. This can give you nearly a 10x speedup compared to a naive linear sum.
A couple of other final notes:
In terms of throughput, the "best" algorithm here is summing 819,200,000 doubles in 71ms, which is ~11.5 billion values per second (compared to ~1.4 billion per second using a naive sum). Your mileage will probably vary significantly, depending on your CPU.
Upvotes: 1
Reputation: 431
I believe the algothim I used, was indeed naive.
inline def sum: Double =
var i: Int = 0
var acc = DoubleVector.zero(Matrix.doubleSpecies)
val sp = Matrix.doubleSpecies
val l = sp.length()
while i < sp.loopBound(vec.length) do
acc = acc.add(DoubleVector.fromArray(Matrix.doubleSpecies, vec, i))
i += l
end while
var temp = acc.reduceLanes(VectorOperators.ADD)
// var temp = 0.0
while i < vec.length do
temp += vec(i)
i += 1
end while
temp
end sum
With the key changes being;
This provides a more satisfactory result.
umBenchmark.sum_loop 3 N/A thrpt 3 454987372.518 ± 4974554.176 ops/s
SumBenchmark.sum_loop 100 N/A thrpt 3 22881036.167 ± 429259.728 ops/s
SumBenchmark.sum_loop 100000 N/A thrpt 3 10733.807 ± 65.014 ops/s
SumBenchmark.sum_vec 3 N/A thrpt 3 454811646.671 ± 12166349.908 ops/s
SumBenchmark.sum_vec 100 N/A thrpt 3 40353921.106 ± 547651.178 ops/s
SumBenchmark.sum_vec 100000 N/A thrpt 3 40141.621 ± 714.686 ops/s
SumBenchmark.sum_vec_alt 3 N/A thrpt 3 505591602.643 ± 36682920.662 ops/s
SumBenchmark.sum_vec_alt 100 N/A thrpt 3 110847968.173 ± 1363183.523 ops/s
SumBenchmark.sum_vec_alt 100000 N/A thrpt 3 42821.901 ± 244.293 ops/s
For very large vectors, a factor of 4x (or so).
Upvotes: 2