How to reason about large performance differences between these two very similar functions?

Question

The following two functions calculate the same value. They differ only in a subtraction, a cast and a replacing skip for take in an iterator.

They have a 5.5x difference in performance.

fn lehmer_index(xs: [u8; 8]) -> u32 {
    const FACTORIALS: [u32; 8] = [1, 1, 2, 6, 24, 120, 720, 5040];
    (0..8).fold(0, |acc, i| {
        acc + FACTORIALS[7 - i] * xs.iter().skip(i + 1).filter(|&&x| x < xs[i]).count() as u32
    })
}

fn lehmer_index2(xs: [u8; 8]) -> u32 {
    const FACTORIALS: [u32; 8] = [1, 1, 2, 6, 24, 120, 720, 5040];
    (0..8).fold(0, |acc, i| {
        acc + FACTORIALS[7 - i] * ((xs[i] as u32) - xs.iter().take(i).filter(|&&x| x < xs[i]).count() as u32)
    })
}

The benchmark:

lehmer_index            time:   [8.4142 ns 8.4796 ns 8.5542 ns]
lehmer_index2           time:   [46.726 ns 46.812 ns 46.921 ns]

In my simulations, I'm calculating this trillions of times and it's a huge difference. I'd use the first one but the second one is a generalized version that uses input with fewer assumptions which I need.

Why the speed difference? How do I reason about performance differences like this?

user1937198 · Accepted Answer

Looking at the assembly it looks like the first one is completely inlined and unrolled vectorized code, whilst the second tripped the optimizer so it couldn't unroll the loops.

There isn't any real way to reason about this from the rust code, you need to pull out the resulting assembly and an intuition of modern x86 processors. Then just try modifying the code until the optimizer produces something that meets the benchmark.

In the specific case (for rust 1.47), simply reordering the expression by assigning the inner values to variables seems to result in linear assembly that may prove significantly better in your benchmark:

pub fn lehmer_index3(xs: [u8; 8]) -> u32 {
    const FACTORIALS: [u32; 8] = [1, 1, 2, 6, 24, 120, 720, 5040];
    (0..8).fold(0, |acc, i| {
        let a = xs[i] as u32;
        let b = xs.iter()
            .take(i)
            .filter(|&&x| x < xs[i])
            .count() as u32;
        let c = a - b;
        acc + FACTORIALS[7 - i] * c
    })
}

How to reason about large performance differences between these two very similar functions?

Answers (1)

Related Questions