Prefetching and yielding to "hide" cache misses in rust

Question

When C++20 got stackless coroutines some papers successfully used them to "hide" cache misses via prefetching and switching to another coroutine. As far as I can tell rust's async is also like stackless coroutines in that it is a "zero cost abstraction". Is there work similar to the ones I mentioned for implementing such techniques in rust? If not is there is anything fundamentally preventing one from doing something like that with async/await?

Edit: I wanted to give a very high level and simplified summary of what I understand that the papers propose:

we want to run a bunch of independent processes that look like the following

P1 = load1;proc1; load1';proc1'; load1'';proc1''; ... 
P2 = load2;proc2; load2';proc2'; load2'';proc2''; ... 
...
PN = loadN;procN; loadN';procN'; loadN'';procN''; ...

where all the loadI terms are likely to cause a cache miss. The authors leverage coroutines to (dynamically) interleave the processes so that the code executed looks like the following:

P = 

prefetch1;prefetch2;...;prefetchN;
load1;proc1;prefetch1'; # yield to the scheduler
load2;proc2;prefetch2'; # yield to the scheduler
...
loadN;procN;prefetchN'; # yield to the scheduler
load1';proc1';prefetch1''; # yield to the scheduler
load2';proc2';prefetch2''; # yield to the scheduler

...
loadN';procN';prefetchN''; # yield to the scheduler
...

Aiden4 · Accepted Answer

The full code for this post can be found here.

I didn't look super hard, but as far as I'm aware there is no existing research on the topic, so I decided to do a little bit of it myself. Unlike the mentioned papers, I took a simple approach to test if it was possible by creating a couple of simple linked lists and summing them:

pub enum List {
    Cons(T, Box>),
    Nil,
}
impl List {
    pub fn new(iter: impl IntoIterator) -> Self {
        let mut tail = List::Nil;
        for item in iter {
            tail = List::Cons(item, Box::new(tail));
        }
        tail
    }
}
const END: i32 = 1024 * 1024 * 1024 / 16;
fn gen_lists() -> (List, List) {
    let range = 1..=END;
    (List::new(range.clone()), List::new(range))
}

Ok, a couple of big simple linked lists. I ran nine different algorithms in two different benchmarks to see how the prefetching affected things. The benchmarks are summing the lists in an owned fashion, where the list is destroyed during iteration, causing deallocation to be the bulk of the measured time, and summing them in a borrowed fashion, where the deallocation time is not measured. The various algorithms tested are really only three different algorithms implemented with three different techniques, Iterators, Generators, and async Streams.

The three algorithms are zip, which is iterating over both lists in lockstep, chain, which is iterating over the lists one after the other, and zip prefetch, where the prefetch and switch method is used while zipping the two lists together. The basic iterator looks like this:

pub struct ListIter(List);
impl Iterator for ListIter {
    type Item = T;
    fn next(&mut self) -> Option {
        let mut temp = List::Nil;
        std::mem::swap(&mut temp, &mut self.0);
        match temp {
            List::Cons(t, next) => {
                self.0 = *next;
                Some(t)
            }
            List::Nil => None,
        }
    }
}

and the version with prefetching looks like this:

pub struct ListIterPrefetch(List);
impl Iterator for ListIterPrefetch {
    type Item = T;
    fn next(&mut self) -> Option {
        let mut temp = List::Nil;
        std::mem::swap(&mut temp, &mut self.0);
        match temp {
            List::Cons(t, next) => {
                self.0 = *next;
                if let List::Cons(_, next) = &self.0 {
                    unsafe { prefetch_read_data::>(&**next, 3) }
                }
                Some(t)
            }
            List::Nil => None,
        }
    }
}

There are also different implementations for Generators and Streams, as well as a version that operates over references but they all look pretty much the same, so I am omitting them for brevity. The test harness is pretty simple- it just takes in a name-function pair and times it:

type BenchFn = fn(List, List) -> T;
fn bench(funcs: &[(&str, BenchFn)]) {
    for (s, f) in funcs {
        let (l, r) = gen_lists();
        let now = Instant::now();
        println!("bench: {s} result: {} time: {:?}", f(l, r), now.elapsed());
    }
}

For example, usage with the basic iterator tests:

bench(&[
    ("iter::zip", |l, r| {
        l.into_iter().zip(r).fold(0, |a, (l, r)| a + l + r)
    }),
    ("iter::zip prefetch", |l, r| {
        l.into_iter_prefetch()
            .zip(r.into_iter_prefetch())
            .fold(0, |a, (l, r)| a + l + r)
    }),
    ("iter::chain", |l, r| l.into_iter().chain(r).sum()),
]);

The results of the test on my computer, which is an Intel(R) Core(TM) i5-8365U CPU with 24 Gb of RAM:

Bench owned
bench: iter::zip result: 67108864 time: 11.1873901s
bench: iter::zip prefetch result: 67108864 time: 19.3889487s
bench: iter::chain result: 67108864 time: 8.4363853s
bench: generator zip result: 67108864 time: 16.7242197s
bench: generator chain result: 67108864 time: 8.9897788s
bench: generator prefetch result: 67108864 time: 11.7599589s
bench: stream::zip result: 67108864 time: 14.339864s
bench: stream::chain result: 67108864 time: 7.7592133s
bench: stream::zip prefetch result: 67108864 time: 11.1455706s
Bench ref
bench: iter::zip result: 67108864 time: 1.1343996s
bench: iter::zip prefetch result: 67108864 time: 864.4865ms
bench: iter::chain result: 67108864 time: 1.4036277s
bench: generator zip result: 67108864 time: 1.1360857s
bench: generator chain result: 67108864 time: 1.740029s
bench: generator prefetch result: 67108864 time: 904.1086ms
bench: stream::zip result: 67108864 time: 1.0902568s
bench: stream::chain result: 67108864 time: 1.5683112s
bench: stream::zip prefetch result: 67108864 time: 1.2031745s

The result is the sum calculated, and time was the recorded time. When looking at the destructive summation benchmarks, a few things stand out:

The chain algorithm works the best. My guess is this is because this method improves cache locality for the allocator, which is where the vast majority of the time is spent.
Prefetching greatly improves the time with the Stream and generator versions, bring their times on par with the standard iterator.
Prefetching completely ruins the iterator strategy. This is why you always benchmark when doing these things. I also would not be surprised if this is a failure by the compiler to optimize properly, rather than the prefetching directly hurting performance.

When looking at the borrowing summation a few things stand out:

Without measuring deallocation times, the recorded times are much shorter. This is how I know deallocation is most of the above benchmark.
The chain method loses- apparently running in lockstep is the way to go.
Prefetching is the way to go for iterators and generators. In contrast to the previous benchmark, prefetching causes the iterator to be the fastest strategy, rather than the slowest.
Prefetching causes a slow down when using streams, although the streams just performed poorly overall.

This testing wasn't the most scientific, for a variety reasons, and isn't for a particularly realistic workload, but given the results I can confidantly say that a prefetch and switch strategy can definitely result in solid performance improvements if done right. I also omitted a fair bit of the testing code for brevity, and the full code can be found here.

Prefetching and yielding to "hide" cache misses in rust

Answers (1)

Related Questions

Prefetching and yielding to &quot;hide&quot; cache misses in rust

Answers (1)

Related Questions

Prefetching and yielding to "hide" cache misses in rust