Speeding up the loop

Question

I have the following piece of code:

for chunk in imagebuf.chunks_mut(4) {
    let temp = chunk[0];
    chunk[0] = chunk[2];
    chunk[2] = temp;
}

For an array of 40000 u8s, it takes about 2.5 ms on my machine, compiled using cargo build --release.

The following C++ code takes about 100 us for the exact same data (verified by implementing it and using FFI to call it from rust):

for(;imagebuf!=endbuf;imagebuf+=4) {
    char c=imagebuf[0];
    imagebuf[0]=imagebuf[2];
    imagebuf[2]=c;
}

I'm thinking it should be possible to speed up the Rust implementation to perform as fast as the C++ version.

The Rust program was built using cargo --release, the C++ program was built without any optimization flags.

Any hints?

Lukas Kalbertodt · Accepted Answer

I cannot reproduce the timings you are getting. You probably have an error in how you measure (or I have 😉). On my machine both versions run in exactly the same time.

In this answer, I will first compare the assembly output of both, the C++ and the Rust version. Afterwards I will describe how to reproduce my timings.

Assembly comparison

I generated the assembly code with the amazing Compiler Explorer (Rust code, C++ Code). I compiled the C++ code with optimizations activated (-O3), too, to make it a fair game (C++ compiler optimizations had no impact on the measured timings though). Here is the resulting assembly (Rust left, C++ right):

example::foo_rust:                    |  foo_cpp(char*, char*):
    test    rsi, rsi                  |      cmp     rdi, rsi
    je      .LBB0_5                   |      je      .L3
    mov     r8d, 4                    |  
.LBB0_2:                              |  .L5:
    cmp     rsi, 4                    |  
    mov     rdx, rsi                  |  
    cmova   rdx, r8                   |  
    test    rdi, rdi                  |  
    je      .LBB0_5                   |  
    cmp     rdx, 3                    |  
    jb      .LBB0_6                   |  
    movzx   ecx, byte ptr [rdi]       |      movzx   edx, BYTE PTR [rdi]
    movzx   eax, byte ptr [rdi + 2]   |      movzx   eax, BYTE PTR [rdi+2]
                                      |      add     rdi, 4
    mov     byte ptr [rdi], al        |      mov     BYTE PTR [rdi-2], al
    mov     byte ptr [rdi + 2], cl    |      mov     BYTE PTR [rdi-4], dl
    lea     rdi, [rdi + rdx]          |  
    sub     rsi, rdx                  |      cmp     rsi, rdi
    jne     .LBB0_2                   |      jne     .L5 
.LBB0_5:                              |  .L3:
                                      |      xor     eax, eax
    ret                               |      ret
.LBB0_6:                              |  
    push    rbp                       +-----------------+  
    mov     rbp, rsp                                    |  
    lea     rdi, [rip + panic_bounds_check_loc.3]       |  
    mov     esi, 2                                      |  
    call    core::panicking::panic_bounds_check@PLT     |

You can immediately see that C++ does in fact produce a lot less assembly (without optimization C++ produced nearly as many instruction as Rust does). I am not sure about all of the additional instructions Rust produces, but at least half of them are for bound checking. But this bound checking is, as far as I understand, not for the actual accesses via [] but just once every loop iteration. This is just for the case that the slice's length is not divisible by 4. But I guess the Rust assembly could be better still (even with bound checks).

As mentioned in the comments, you can remove bound checking by using get_unchecked() and get_unchecked_mut(). Note however, that this did not influence the performance in my measurements!

Lastly: you should use [&]::swap(i, j) here.

for chunk in imagebuf.chunks_mut(4) {
    chunk.swap(0, 2);
}

This, again, did not notably influence performance. But it's shorter and better code.

Measuring

I used this C++ code (in foocpp.cpp):

extern "C" void foo_cpp(char *imagebuf, char *endbuf);

void foo_cpp(char* imagebuf, char* endbuf) {
    for(;imagebuf!=endbuf;imagebuf+=4) {
        char c=imagebuf[0];
        imagebuf[0]=imagebuf[2];
        imagebuf[2]=c;
    }
}

I compiled it with:

gcc -c -O3 foocpp.cpp && ar rvs libfoocpp.a foocpp.o

Then I used this Rust code to measure everything:

#![feature(test)]

extern crate libc;
extern crate test;

use test::black_box;
use std::time::Instant;

#[link(name = "foocpp")]
extern {
    fn foo_cpp(start: *mut libc::c_char, end: *const libc::c_char);
}

pub fn foo_rust(imagebuf: &mut [u8]) {
    for chunk in imagebuf.chunks_mut(4) {
        let temp = chunk[0];
        chunk[0] = chunk[2];
        chunk[2] = temp;
    }
}

fn main() {
    let mut buf = [0u8; 40_000];

    let before = Instant::now();

    foo_rust(black_box(&mut buf));
    black_box(buf);

    println!("rust: {:?}", Instant::now() - before);

    // ----------------------------------

    let mut buf = [0u8 as libc::c_char; 40_000];

    let before = Instant::now();

    let ptr = buf.as_mut_ptr();
    let end = unsafe { ptr.offset(buf.len() as isize) };
    unsafe { foo_cpp(black_box(ptr), black_box(end)); }
    black_box(buf);

    println!("cpp:  {:?}", Instant::now() - before);
}

The black_box() all over the place prevents the compiler from optimizing where it isn't supposed to. I executed it with (nightly compiler):

LIBRARY_PATH=.:$LIBRARY_PATH cargo run --release

Giving me (i7-6700HQ) values like these:

rust: Duration { secs: 0, nanos: 30583 }
cpp:  Duration { secs: 0, nanos: 30810 }

The times fluctuate a lot (way more than the difference between both versions). I am not exactly sure why the additional assembly generated by Rust does not result in a slower execution, though.

Speeding up the loop

Answers (1)

Assembly comparison

Measuring

Related Questions