Erroneous single thread memory bandwidth benchmark

Question

In an attempt to measure the bandwidth of the main memory, I have come up with the following approach.

Code (for the Intel compiler)

#include 

#include  // std::cout
#include  // std::numeric_limits
#include  // std::free
#include  // sysconf
#include  // posix_memalign
#include  // std::mt19937


int main()
{
    // test-parameters
    const auto size = std::size_t{150 * 1024 * 1024} / sizeof(double);
    const auto experiment_count = std::size_t{500};
    
    //+/////////////////
    // access a data-point 'on a whim'
    //+/////////////////
    
    // warm-up
    for (auto counter = std::size_t{}; counter < experiment_count / 2; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        
        // deallocate resources
        free(data);
    }
    
    // timed run
    auto min_duration = std::numeric_limits::max();
    for (auto counter = std::size_t{}; counter < experiment_count; ++counter)
    {
        // garbage data allocation and memory page loading
        double* data = nullptr;
        posix_memalign(reinterpret_cast(&data), sysconf(_SC_PAGESIZE), size * sizeof(double));
        if (data == nullptr)
        {
            std::cerr << "Fatal error! Unable to allocate memory." << std::endl;
            std::abort();
        }
        //#pragma omp parallel for simd safelen(8) schedule(static)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = -1.0;
        }
        
        const auto dur1 = omp_get_wtime() * 1E+6;
        //#pragma omp parallel for simd safelen(8) schedule(static)
        #pragma omp simd safelen(8)
        for (auto index = std::size_t{}; index < size; ++index)
        {
            data[index] = 10.0;
        }
        const auto dur2 = omp_get_wtime() * 1E+6;
        const auto run_duration = dur2 - dur1;
        if (run_duration < min_duration)
        {
            min_duration = run_duration;
        }
        
        // deallocate resources
        free(data);
    }
    
    // REPORT
    const auto traffic = size * sizeof(double) * 2; // 1x load, 1x write
    std::cout << "Using " << omp_get_max_threads() << " threads. Minimum duration: " << min_duration << " us;
"
        << "Maximum bandwidth: " << traffic / min_duration * 1E-3 << " GB/s;" << std::endl;
    
    return 0;
}

Notes on code

Assumed to be a 'naive' approach, also linux-only. Should still serve as a rough indicator of model performance
using ICC with compiler flags -O3 -ffast-math -march=coffeelake
size (150 MiB) is much bigger than lowest level cache of system (9 MiB on i5-8400 Coffee Lake), with 2x 16 GiB DIMM DDR4 3200 MT/s
new allocations on each iteration should invalidate all cache-lines from the previous one (to eliminate cache hits)
minimum latency is recorded to counter-act the effects of interrupts and OS-scheduling: threads being taken off cores for a short while etc.
a warm-up run is done to counter-act the effects of dynamic frequency scaling (kernel feature, can alternatively be turned off by using the userspace governor).

Results of code

On my machine, I am getting 90 GB/s. Intel Advisor, which runs its own benchmarks, has calculated or measured this bandwidth to actually be 25 GB/s. (See my previous question: Intel Advisor's bandwidth information where a previous version of this code was getting page-faults inside the timed region.)

Assembly: here's a link to the assembly generated for the above code: https://godbolt.org/z/Ma7PY49bE

I am not able to understand how I'm getting such an unreasonably high result with my bandwidth. Any tips to help facilitate my understanding would be greatly appreciated.

Erroneous single thread memory bandwidth benchmark

Answers (1)

Related Questions