user8143588
user8143588

Reputation:

Why does iteration over an inclusive range generate longer assembly in Rust than in C++?

These two loops are supposed to be equivalent in C++ and Rust:

#include <cstdint>

std::uint64_t sum1(std::uint64_t n) {
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}
pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    for j in 0u64..=num {
        sum += 1;
    }
    return sum;
}

However the C++ version generates a very terse assembly:

sum1(unsigned long):                               # @sum1(unsigned long)
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

while Rust's version is very long with two checks in the loop instead of one:

example::sum1:
        xor     eax, eax
        xor     ecx, ecx
.LBB0_1:
        mov     rdx, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        add     rax, 1
        cmp     rdx, rdi
        jae     .LBB0_3
        cmp     rcx, rdi
        jbe     .LBB0_1
.LBB0_3:
        ret

Godbolt: https://godbolt.org/z/xYW94qxjK

What is Rust intrinsically trying to prevent that C++ is carefree about?

Upvotes: 53

Views: 4960

Answers (3)

Matthieu M.
Matthieu M.

Reputation: 299910

@user3840170 correctly identified the difference: overflow check in Rust, and not in C++.

Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM1.

The optimal code generation for short loops, is to split the loop, transforming:

for j in 0u64..=num {
    sum += 1;
}

Into:

for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
    sum += 1;
}

if 0 <= num {
    sum += 1;
}

This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula2.

The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.

Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.


1 See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.

2 Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.

Upvotes: 17

dumbass
dumbass

Reputation: 27219

Overflow in the iterator state.

The C++ version will loop forever when given a large enough input:

#include <cstdint>

std::uint64_t sum1(std::uint64_t n) {  
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        __asm__ ("");
        sum += 1;
    }
    return sum;
}

#include <iostream>

int main() {
    std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;

    return 0;
}

This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.

Strictly speaking, invoking the original version of sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow alone, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why in my version I added an empty __asm__ statement to the function to act as a dummy ‘side effect’ preventing the compiler from taking ‘advantage’ of that in optimisation passes.

The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:

use std::num::Wrapping;

fn sum1(num: u64) -> u64 {
    let mut sum = Wrapping(0);
    for _ in 0..=num {
        sum += Wrapping(1);
    }
    return sum.0;
}

fn main() {
    println!("{}", sum1(0xffffffff_ffffffff));
}

The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18 446 744 073 709 551 616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.

Upvotes: 46

Stargateur
Stargateur

Reputation: 26727

These two loops are equivalent in C++ and Rust

Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).

A rust equivalent code could be:

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    let mut j = 0;
    while j <= num {
        sum = sum.wrapping_add(1);
        j = j.wrapping_add(1);
    }
    sum
}

currently produce the following asm godbolt:

example::sum1:
        xor     eax, eax
.LBB0_1:
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

Upvotes: 18

Related Questions