DBE
DBE

Reputation: 353

Rust target-cpu=native gets slower SIMD execution

I'm making a simple test of the Rust wrappers for x86 intrinsics: the approximation of PI by the Leibniz series:

#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

fn main() {
    let mut n: u64 = 0;
    let pi4 = std::f64::consts::PI / 4.0;
    unsafe {
        let mut dens = _mm256_set_pd(1.0f64, -3.0f64, 5.0f64, -7.0f64);
        let adder = _mm256_set_pd(8.0f64, -8.0f64, 8.0f64, -8.0f64);
        let ones = _mm256_set1_pd(1.0f64);
        let mut rsum = _mm256_set1_pd(0.0f64);
        let mut quotients: __m256d;
        loop {
            quotients = _mm256_div_pd(ones, dens);
            rsum = _mm256_add_pd(rsum, quotients);
            dens = _mm256_add_pd(dens, adder);
            n = n + 1;
            let vlow = _mm256_extractf128_pd(rsum, 0);
            let vhigh = _mm256_extractf128_pd(rsum, 1);
            let add_partial = _mm_add_pd(vlow, vhigh);
            let sum = _mm_cvtsd_f64(add_partial)
                + _mm_cvtsd_f64(_mm_unpackhi_pd(add_partial, add_partial));
            if f64::abs(pi4 - sum) < 1.0e-9 {
                break;
            }
        }
    }
    println!("Steps: {}", 4 * n);
}

Functionally, the program works as expected. My CPU model is "AMD A8-9600 RADEON R7", and:

$ rustc --target=x86_64-linux-kernel --print target-cpus
Available CPUs for this target:
    native         - Select the CPU of the current host (currently bdver4).

When compiling with:

$ cargo build --release

The time is:

$ time target/release/sotest 
real    0m1.668s
user    0m1.667s
sys 0m0.001s

But with the "native" target it runs slower:

$ RUSTFLAGS="-C target-cpu=native" cargo build --release
...
$ time target/release/sotest
real    0m2.783s
user    0m2.778s
sys 0m0.004s

The question is what's wrong with the "native" target-cpu? At first sight of the documentation, I expected a binary leveraging all my CPU's provided extensions:

The compiler will translate this into a list of target features.

Even if it does not consider the extensions, why did get slower?

BTW, compiling selecting the avx extension generates a big boost:

RUSTFLAGS="-C target-feature=+avx" cargo build --release
...
real    0m0.358s
user    0m0.354s
sys 0m0.004s

EDIT: Using Ubuntu 20.04 kernel 5.4.0-72-generic. rustc 1.51.0

Upvotes: 4

Views: 4693

Answers (1)

BurntSushi5
BurntSushi5

Reputation: 15334

My guess is that you're hitting this bug: https://github.com/rust-lang/rust/issues/83027, which was resolved on March 17, 2021 by https://github.com/rust-lang/rust/pull/83084.

The bug is that when native is used, target_feature isn't applied correctly, which is what all of the intrinsics use. As a result, your calls to the intrinsic functions probably aren't being inlined. You should look at a profile to confirm that.

More generally, I would recommend using runtime CPU feature detection and correct use of #[target_feature]. You should only be calling functions that operate on 32-byte vectors from functions that have at least the avx feature enabled.

Upvotes: 6

Related Questions