sarema
sarema

Reputation: 725

Why is Vec::with_capacity slower than Vec::new for small final lengths?

Consider this code.

type Int = i32;
const MAX_NUMBER: Int = 1_000_000;

fn main() {
    let result1 = with_new();
    let result2 = with_capacity();
    assert_eq!(result1, result2)
}

fn with_new() -> Vec<Int> {
    let mut result = Vec::new();
    for i in 0..MAX_NUMBER {
        result.push(i);
    }
    result
}

fn with_capacity() -> Vec<Int> {
    let mut result = Vec::with_capacity(MAX_NUMBER as usize);
    for i in 0..MAX_NUMBER {
        result.push(i);
    }
    result
}

Both functions produce the same output. One uses Vec::new, the other uses Vec::with_capacity. For small values of MAX_NUMBER (like in the example), with_capacity is slower than new. Only for larger final vector lengths (e.g. 100 million) the version using with_capacity is as fast as using new.

Flamegraph for 1 million elements 1 million

Flamegraph for 100 million elements 100 million

It is my understanding that with_capacity should always be faster if the final length is known, because data on the heap is allocated once which should result in a single chunk. In contrast, the version with new grows the vector MAX_NUMBER times, which results in more allocations.

What am I missing?

Edit

The first section was compiled with the debug profile. If I use the release profile with the following settings in Cargo.toml

[package]
name = "vec_test"
version = "0.1.0"
edition = "2021"

[profile.release]
opt-level = 3
debug = 2

I still get the following result for a length of 10 million.

10 million, release profile

Upvotes: 2

Views: 1980

Answers (1)

kmdreko
kmdreko

Reputation: 60492

I was not able to reproduce this in a synthetic benchmark:

use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};

fn with_new(size: i32) -> Vec<i32> {
    let mut result = Vec::new();
    for i in 0..size {
        result.push(i);
    }
    result
}

fn with_capacity(size: i32) -> Vec<i32> {
    let mut result = Vec::with_capacity(size as usize);
    for i in 0..size {
        result.push(i);
    }
    result
}

pub fn with_new_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("with_new");
    for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
        group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
            b.iter(|| with_new(size));
        });
    }
    group.finish();
}

pub fn with_capacity_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("with_capacity");
    for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000].iter() {
        group.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &size| {
            b.iter(|| with_capacity(size));
        });
    }
    group.finish();
}

criterion_group!(benches, with_new_benchmark, with_capacity_benchmark);
criterion_main!(benches);

Here's the output (with outliers and other benchmarking stuff removed):

with_new/100            time:   [331.17 ns 331.38 ns 331.61 ns]
with_new/1000           time:   [1.1719 us 1.1731 us 1.1745 us]
with_new/10000          time:   [8.6784 us 8.6840 us 8.6899 us]
with_new/100000         time:   [77.524 us 77.596 us 77.680 us]
with_new/1000000        time:   [1.6966 ms 1.6978 ms 1.6990 ms]
with_new/10000000       time:   [22.063 ms 22.084 ms 22.105 ms]

with_capacity/100       time:   [76.796 ns 76.859 ns 76.926 ns]
with_capacity/1000      time:   [497.90 ns 498.14 ns 498.39 ns]
with_capacity/10000     time:   [5.0058 us 5.0084 us 5.0112 us]
with_capacity/100000    time:   [50.358 us 50.414 us 50.470 us]
with_capacity/1000000   time:   [1.0861 ms 1.0868 ms 1.0876 ms]
with_capacity/10000000  time:   [10.644 ms 10.656 ms 10.668 ms]

The with_capacity runs were consistently faster than with_new. The closest ones being in the 10,000- to 1,000,000-element runs where with_capacity still only took ~60% of the time, while others had it taking half the time or less.

It crossed my mind that there could be some strange const-propagation behavior going on, but even with individual functions with hard-coded sizes (playground for brevity), the behavior didn't significantly change:

with_new/100            time:   [313.87 ns 314.22 ns 314.56 ns]
with_new/1000           time:   [1.1498 us 1.1505 us 1.1511 us]
with_new/10000          time:   [7.9062 us 7.9095 us 7.9130 us]
with_new/100000         time:   [77.925 us 77.990 us 78.057 us]
with_new/1000000        time:   [1.5675 ms 1.5683 ms 1.5691 ms]
with_new/10000000       time:   [20.956 ms 20.990 ms 21.023 ms]

with_capacity/100       time:   [76.943 ns 76.999 ns 77.064 ns]
with_capacity/1000      time:   [535.00 ns 536.22 ns 537.21 ns]
with_capacity/10000     time:   [5.1122 us 5.1150 us 5.1181 us]
with_capacity/100000    time:   [50.064 us 50.092 us 50.122 us]
with_capacity/1000000   time:   [1.0768 ms 1.0776 ms 1.0784 ms]
with_capacity/10000000  time:   [10.600 ms 10.613 ms 10.628 ms]

Your testing code only calls each strategy once, so its conceivable that your testing environment is changed after calling the first one (a potential culprit being heap fragmentation as suggested by @trent_formerly_cl in the comments, though there could be others: cpu boosting/throttling, spatial and/or temporal cache locality, OS behavior, etc.). A benchmarking framework like criterion helps avoid a lot of these problems by iterating each test multiple times (including warmup iterations).

Upvotes: 6

Related Questions