Reputation: 165
I ran a small identical benchmark on both Java and Rust.
Java:
public class Main {
private static final int NUM_ITERS = 100;
public static void main(String[] args) {
long tInit = System.nanoTime();
int c = 0;
for (int i = 0; i < NUM_ITERS; ++i) {
for (int j = 0; j < NUM_ITERS; ++j) {
for (int k = 0; k < NUM_ITERS; ++k) {
if (i*i + j*j == k*k) {
++c;
System.out.println(i + " " + j + " " + k);
}
}
}
}
System.out.println(c);
System.out.println(System.nanoTime() - tInit);
}
}
Rust:
use std::time::SystemTime;
const NUM_ITERS: i32 = 100;
fn main() {
let t_init = SystemTime::now();
let mut c = 0;
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
println!("{} {} {}", i, j, k);
}
}
}
}
println!("{}", c);
println!("{}", t_init.elapsed().unwrap().as_nanos());
}
When NUM_ITERS = 100
, as expected, Rust out-performed Java
Java: 59311348 ns
Rust: 29629242 ns
But for NUM_ITERS = 1000
, I saw that Rust took much longer and Java was way faster
Java: 1585835361 ns
Rust: 28623818145 ns
What could be the reason for this? Shouldn't Rust perform better than Java in this case too? Or is it because I have made some mistake in the implementation?
I removed the lines System.out.println(i + " " + j + " " + k);
and println!("{} {} {}", i, j, k);
from the codes. And here are the outputs
NUM_ITERS = 100
Java: 3843114 ns
Rust: 29072345 ns
NUM_ITERS = 1000
Java: 1014829974 ns
Rust: 28402166953 ns
So, without the println
statements, Java performs better than Rust in both cases. I simply want to know that why that is the case. Java has the Garbage Collector running and other overheads. Have I not implemented the loops in Rust optimally?
Upvotes: 13
Views: 14535
Reputation: 249
The Java VM possibly just-in-time compiles your inner loop to native assembly code after it exceeds a certain number of iterations. I am not an JVM expert but this could possibly explain why a larger number of iterations magically makes the java code run faster.
The technology behind this is called HotSpot and afaik it works differently on "client" and "server" VMs. It also depends on the vendor of the JVM if it is available and how it behaves.
https://en.wikipedia.org/wiki/HotSpot_(virtual_machine)
Upvotes: 2
Reputation: 200296
I adjusted your code to eliminate the points of criticism laid out in the comments. Not compiling Rust for production is the biggest problem, that introduces a 50x overhead. Beyond that, I eliminated printing while measuring, and did proper warming up of the Java code.
I would say that Java and Rust were on par after these changes, they are within 2x of each other and both have very low cost per iteration (just a fraction of a nanosecond).
Here is my code:
public class Testing {
private static final int NUM_ITERS = 1_000;
private static final int MEASURE_TIMES = 7;
public static void main(String[] args) {
for (int i = 0; i < MEASURE_TIMES; i++) {
System.out.format("%.2f ns per iteration%n", benchmark());
}
}
private static double benchmark() {
long tInit = System.nanoTime();
int c = 0;
for (int i = 0; i < NUM_ITERS; ++i) {
for (int j = 0; j < NUM_ITERS; ++j) {
for (int k = 0; k < NUM_ITERS; ++k) {
if (i*i + j*j == k*k) {
++c;
}
}
}
}
if (c % 137 == 0) {
// Use c so its computation can't be elided
System.out.println("Count is divisible by 13: " + c);
}
long tookNanos = System.nanoTime() - tInit;
return tookNanos / ((double) NUM_ITERS * NUM_ITERS * NUM_ITERS);
}
}
use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
fn main() {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
let took_ns = t_init.elapsed().unwrap().as_nanos() as f64;
let iters = NUM_ITERS as f64;
println!("{} ns per iteration", took_ns / (iters * iters * iters));
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
}
I run Java from IntelliJ, with JDK 16. I run Rust from the command line, using cargo run --release
.
Example of Java output:
0.98 ns per iteration
0.93 ns per iteration
0.32 ns per iteration
0.34 ns per iteration
0.32 ns per iteration
0.33 ns per iteration
0.32 ns per iteration
Example of Rust output:
0.600314 ns per iteration
While I'm not necessarily surprised to see Java giving a better result (its JIT compiler has been optimized for 20 years now and there's no object allocation, so no GC), I was puzzled at the overall low cost of an iteration. We can assume the expression i*i + j*j
to be hoisted out of the inner loop, which leaves just k*k
inside it.
I used a disassembler to check out the code Rust produced. It definitely involves IMUL in the innermost loop. I read this answer, which says Intel has a latency of just 3 CPU cycles for an IMUL instruction. Combine that with multiple ALUs and instruction parallelism, and the result of 1 cycle per iteration becomes more plausible.
Another interesting thing I discovered is that, if I just check c % 137 == 0
but don't print the actual value of c
in the Rust println!
statement, (only print "Count is divisible by 137"), iteration cost drops to just 0.26 ns. So Rust was able to eliminate a lot of work from the loop when I didn't ask for the exact value of c
.
As discussed in the comments with @trentci, I mimicked the Java code more completely, adding an outer loop that repeats the measurement, which is now in a separate function:
use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
const MEASURE_TIMES: i32 = 7;
fn main() {
let total_iters: f64 = NUM_ITERS as f64 * NUM_ITERS as f64 * NUM_ITERS as f64;
for _ in 0..MEASURE_TIMES {
let took_ns = benchmark() as f64;
println!("{} ns per iteration", took_ns / total_iters);
}
}
fn benchmark() -> u128 {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
return t_init.elapsed().unwrap().as_nanos();
}
Now I'm getting this output:
0.781475 ns per iteration
0.760657 ns per iteration
0.783821 ns per iteration
0.777313 ns per iteration
0.766473 ns per iteration
0.774042 ns per iteration
0.766718 ns per iteration
Another subtle change to the code that resulted in a significant change in performance. However, it also shows a key advantage of Rust over Java: there is no warmup needed to get the optimum performance.
Upvotes: 17