Reputation: 627
If I run these benchmarks in Rust:
#[bench]
fn bench_rnd(b: &mut Bencher) {
let mut rng = rand::weak_rng();
b.iter(|| rng.gen_range::<f64>(2.0, 100.0));
}
#[bench]
fn bench_ln(b: &mut Bencher) {
let mut rng = rand::weak_rng();
b.iter(|| rng.gen_range::<f64>(2.0, 100.0).ln());
}
The result is:
test tests::bench_ln ... bench: 121 ns/iter (+/- 2)
test tests::bench_rnd ... bench: 6 ns/iter (+/- 0)
121-6 = 115 ns per ln
call.
But the same benchmark in Java:
@State(Scope.Benchmark)
public static class Rnd {
final double x = ThreadLocalRandom.current().nextDouble(2, 100);
}
@Benchmark
public double testLog(Rnd rnd) {
return Math.log(rnd.x);
}
Gives me:
Benchmark Mode Cnt Score Error Units
Main.testLog avgt 20 31,555 ± 0,234 ns/op
The log is ~3.7 times slower (115/31) in Rust than in Java.
When I test the hypotenuse implementation (hypot
), the implementation in Rust is 15.8 times faster than in Java.
Have I written bad benchmarks or it is a performance issue?
Responses to questions asked in comments:
"," is a decimal separator in my country.
I run Rust's benchmark using cargo bench
which always runs in release mode.
The Java benchmark framework (JMH) creates a new object for every call, even though it's a static
class and a final
variable. If I add a random creation in the tested method, I get 43 ns/op.
Upvotes: 21
Views: 4798
Reputation: 120968
I'm going to provide the other half of the explanation since I don't know Rust. Math.log
is annotated with @HotSpotIntrinsicCandidate
meaning that it will be replaced by a native CPU instruction for such an operation: think Integer.bitCount
that would either do a lot of shifting or use a direct CPU instruction that does that much faster.
Having an extremely simple program like this:
public static void main(String[] args) {
System.out.println(mathLn(20_000));
}
private static long mathLn(int x) {
long result = 0L;
for (int i = 0; i < x; ++i) {
result = result + ln(i);
}
return result;
}
private static final long ln(int x) {
return (long) Math.log(x);
}
And running it with:
java -XX:+UnlockDiagnosticVMOptions
-XX:+PrintInlining
-XX:+PrintIntrinsics
-XX:CICompilerCount=2
-XX:+PrintCompilation
package/Classname
It will generate a lot of lines, but one of them is:
@ 2 java.lang.Math::log (5 bytes) intrinsic
making this code extremely fast.
I don't really know when and how that happens in Rust though...
Upvotes: 10
Reputation: 627
The answer was given by @kennytm:
export RUSTFLAGS='-Ctarget-cpu=native'
Fixes the problem. After that, the results are:
test tests::bench_ln ... bench: 43 ns/iter (+/- 3)
test tests::bench_rnd ... bench: 5 ns/iter (+/- 0)
I think 38 (± 3) is close enough to 31.555 (± 0.234).
Upvotes: 14