What is the practical relationship between the streams memory bandwidth benchmark and the potential speedup from running MPI locally?

Question

I ran the streams memory bandwidth benchmark (https://www.cs.virginia.edu/stream/) on a computer with 10 processors. The benchmark indicated that after 3 or 4 processors, the speedup plateaued at about 3x. What are the practical implications of this result for the performance of an MPI code? For simplicity, assume the program is running multiple processes locally on this multicore machine only. Does this mean that if you are running a memory access intensive program, then you will not be able to get more than 3x speedup, even if you use all the cores? If you ran a program that was not memory-access intensive, could you theoretically get the full 10x? If you simultaneously ran two or three memory access intensive programs, each using three processors, would they each be able to get 3x speedup, or would they interfere with each other and slow each other down as they all simultaneously pulled from RAM?

vim_ · Accepted Answer

Speedup is about how much parallelism exist in the code. Moreover, any resource could also become a bottleneck depending on the type of the application. If your application is memory intensive, then you will be limited by the memory bandwidth. If its not memory intensive, and it is highly parallel (take Monte Carlo sampling as an example), then you will get close to the full speedup from your cores.

To answer your last question (multiple memory-intensive): at the end of the day we rely on memory controllers to do read/write. So it depends on the memory banks and where the physical pages are allocated from. So, any of the two situations that you mentioned could happen.

What is the practical relationship between the streams memory bandwidth benchmark and the potential speedup from running MPI locally?

Answers (1)

Related Questions