Alexander
Alexander

Reputation: 419

Memory bandwidth measurement with memset,memcpy

I am trying to understand the performance of memory operations with memcpy/memset. I measure the time needed for a loop containing memset,memcpy. See the attached code (it is in C++11, but in plain C the picture is the same). It is understandable that memset is faster than memcpy. But this is more-or-less the only thing which I understand... The biggest question is:

Why there is a such a strong dependence on the number of loop iterations?

The application is single threaded! And the CPU is: AMD FX(tm)-4100 Quad-Core Processor.

And here are some numbers:

The code:

/*
-- Compilation and run:
g++ -O3 -std=c++11 -o mem-speed mem-speed.cc && ./mem-speed

-- Output example:
*/

#include <cstdio>
#include <chrono>
#include <memory>
#include <string.h>

using namespace std;

const uint64_t _KB=1024, _MB=_KB*_KB, _GB=_KB*_KB*_KB;


std::pair<double,char> measure_memory_speed(uint64_t buf_size,int n_iters)
{
    // without returning something from the buffers, the compiler will optimize memset() and memcpy() calls
    char retval=0;

    unique_ptr<char[]> buf1(new char[buf_size]), buf2(new char[buf_size]);

    auto time_start = chrono::high_resolution_clock::now();
    for( int i=0; i<n_iters; i++ )
    {
        memset(buf1.get(),123,buf_size);
        retval += buf1[0];
    }
    auto t1 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);

    time_start = chrono::high_resolution_clock::now();
    for( int i=0; i<n_iters; i++ )
    {
        memcpy(buf2.get(),buf1.get(),buf_size);
        retval += buf2[0];
    }
    auto t2 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);

    printf("memset: iters=%d  %g GB in %8.4g s :  %8.4g GB per second\n",
           n_iters,n_iters*buf_size/double(_GB),(double)t1.count()/1e9, n_iters*buf_size/double(_GB) / (t1.count()/1e9) );
    printf("memcpy: iters=%d  %g GB in %8.4g s :  %8.4g GB per second\n",
           n_iters,n_iters*buf_size/double(_GB),(double)t2.count()/1e9, n_iters*buf_size/double(_GB) / (t2.count()/1e9) );
    printf("\n");

    double avr = n_iters*buf_size/_GB * (1e9/t1.count()+1e9/t2.count()) / 2;

    retval += buf1[0]+buf2[0];
    return std::pair<double,char>(avr,retval);
}

int main(int argc,const char **argv)
{
    uint64_t n=64;
    if( argc==2 )
      n = atoi(argv[1]);
    for( int i=0; i<=10; i++ )
        measure_memory_speed(n*_MB,1<<i);

    return 0;
}

Upvotes: 0

Views: 3482

Answers (1)

AnthonyLambert
AnthonyLambert

Reputation: 8830

Surely this is just down to the instruction caches loading - so the code runs faster after the 1st iteration, and the data cache speeding access to the memcpy/memcmp for further iterations. The cache memory is inside the processor so it doesn't have to fetch or put the data to the slower external memory so often - so runs faster.

Upvotes: 1

Related Questions