Per
Per

Reputation: 39

optimize indexing while inside of a loop

i want to try optimizing this line of code:

for i in 0..len {
    slots[values[i].slot].count += 1;
}

(both of these lists are extremely long and the same size)

i have already optimized it like this(2.5x faster), to acess it in cache:

const PREFETCH_DISTANCE: usize = 32;
for i in (0..len).rev() {
    if i >= PREFETCH_DISTANCE {
        let pfs = values[i - PREFETCH_DISTANCE].slot;
        let fetch = &slots[pfs] as *const Slot as *const i8;
        unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
    }
    slots[values[i].slot].count += 1;
}

however it is still not as fast, as i was expecting. for example, this code, for the same list length is more than twice as fast, just because it doesn't index anything:

for val in &*arr {
    let ptr = val as *const T;
    let val = unsafe { ptr.read() };
    let dif = integrated_expected_distribution(&val) - start_val;
    debug_assert!(dif >= 0.);
    let slot = (dif * end_mult) as usize;
    debug_assert!(slot < len);
    values.push(Slottable { value: val, slot });
}

Am i missing something else, that i need to preload for that indexing? Or am i missing something, that causes performance issues?

here is a reproducable example:

    let amount = 100_000_000;
    let highest_possible_num = 1_000_000_000;
    let seed = 44;

    let mut rng = rand::rngs::StdRng::seed_from_u64(seed);

    let mut vec2: Vec<usize> = (0..amount)
        .map(|_| rng.gen_range(0..=highest_possible_num))
        .collect();
    let vec1: Vec<usize> = (0..amount).map(|_| rng.gen_range(0..amount)).collect();
    let len = amount;
    const PREFETCH_DISTANCE: usize = 32;
    let loop_ = Instant::now();
    for i in (0..len).rev() {
        if i >= PREFETCH_DISTANCE {
            let pfs = vec1[i - PREFETCH_DISTANCE];
            let fetch = &vec2[pfs] as *const usize as *const i8;
            unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
        }
        vec2[vec1[i]] += 1;
    }
    let loop_duration = loop_.elapsed();
    println!("Loop time: {:?}", loop_duration);

the removed count and slot are each just part of a datastructure of 8 byte, so the change shouldn't have any effect on the example

Edit: i think instruction-level parallelism can't be done on the loop, so it becomes slower.

Edit2: i've tested some more and Edit1 was wrong. This code is waaay faster than my prefetched code:

    let mut test = 0;
    for i in (0..len).rev() {
        test += values[i].slot;
    }
    println!("test: {}", test);

Upvotes: 1

Views: 136

Answers (0)

Related Questions