Reputation: 649
I want to know how memory behave under a total random read workload. Usually, random access is simulated with pointer chasing. However, this method may cause high dependency between two memory access instructions, which make the tested bandwidth limited by "instruction pipeline latency" instead of "memory channel bandwidth".
How could I test the memory bandwidth limitation under a random access workload?
Upvotes: 0
Views: 482
Reputation: 365517
Use a fast PRNG like xorshift+ to generate random offsets into a big byte array. (Use a power-of-2 array size so you can just mask the PRNG results to modulo them into the array range.)
Anything that defeats HW prefetch is sufficient; it doesn't have to be high-quality randomness. Even an LCG can be good enough, using implicit modulo 2^32 by using fixed-width integers, so you just have a multiply and add. (On modern x86, that's 4-cycle critical-path latency. Even with good memory-level parallelism, modern CPUs can't sustain 1 / 4-cycle throughput for loads that miss all the way to DRAM.)
But for smaller array sizes, where you're getting a lot of L2 hits, you might think about unrolling with two LCGs in parallel, to overlap their dependency chains.
If you specifically want to benchmark all cache misses, then a PRNG with period = array size (like an LCG using the power-of-2 array size as a modulus) could be better than higher-quality randomness where random & (size-1)
will sometimes be the same location. Period = size will give you a shuffle of element indices. It won't avoid spatio-temporal locality entirely, though, so you'll still get some hits in some levels of cache, but not many.
You can make things harder for the cache by making it an array of 64-byte structs, so you only ever access each cache line once. Or an array of int
or int64_t
instead of bytes, to reduce the amount of elements per cache line, and thus reduce the cache capacity if counting in elements instead of bytes.
Upvotes: 0