Reputation: 39
i want to try optimizing this line of code:
for i in 0..len {
slots[values[i].slot].count += 1;
}
(both of these lists are extremely long and the same size)
i have already optimized it like this(2.5x faster), to acess it in cache:
const PREFETCH_DISTANCE: usize = 32;
for i in (0..len).rev() {
if i >= PREFETCH_DISTANCE {
let pfs = values[i - PREFETCH_DISTANCE].slot;
let fetch = &slots[pfs] as *const Slot as *const i8;
unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
}
slots[values[i].slot].count += 1;
}
however it is still not as fast, as i was expecting. for example, this code, for the same list length is more than twice as fast, just because it doesn't index anything:
for val in &*arr {
let ptr = val as *const T;
let val = unsafe { ptr.read() };
let dif = integrated_expected_distribution(&val) - start_val;
debug_assert!(dif >= 0.);
let slot = (dif * end_mult) as usize;
debug_assert!(slot < len);
values.push(Slottable { value: val, slot });
}
Am i missing something else, that i need to preload for that indexing? Or am i missing something, that causes performance issues?
here is a reproducable example:
let amount = 100_000_000;
let highest_possible_num = 1_000_000_000;
let seed = 44;
let mut rng = rand::rngs::StdRng::seed_from_u64(seed);
let mut vec2: Vec<usize> = (0..amount)
.map(|_| rng.gen_range(0..=highest_possible_num))
.collect();
let vec1: Vec<usize> = (0..amount).map(|_| rng.gen_range(0..amount)).collect();
let len = amount;
const PREFETCH_DISTANCE: usize = 32;
let loop_ = Instant::now();
for i in (0..len).rev() {
if i >= PREFETCH_DISTANCE {
let pfs = vec1[i - PREFETCH_DISTANCE];
let fetch = &vec2[pfs] as *const usize as *const i8;
unsafe { _mm_prefetch(fetch, _MM_HINT_T0) }
}
vec2[vec1[i]] += 1;
}
let loop_duration = loop_.elapsed();
println!("Loop time: {:?}", loop_duration);
the removed count and slot are each just part of a datastructure of 8 byte, so the change shouldn't have any effect on the example
Edit: i think instruction-level parallelism can't be done on the loop, so it becomes slower.
Edit2: i've tested some more and Edit1 was wrong. This code is waaay faster than my prefetched code:
let mut test = 0;
for i in (0..len).rev() {
test += values[i].slot;
}
println!("test: {}", test);
Upvotes: 1
Views: 136