Prunus Persica
Prunus Persica

Reputation: 1203

memcpy beats SIMD intrinsics

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.

I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.

I have got four versions to copy data:

1. Baseline

Copies element by element:

for (int i = 0; i < size; ++i)
{
    copy[i] = orig[i];
}

2. NEON

This code loads four values into a temporary register, then copies the register to output.

Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.

int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
    tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
    vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}

3. Stepped memcpy,

Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.

for (int i = 0; i < size; i+=4)
{
    memcpy(orig+i, copy3+i, 4);
}

4. Normal memcpy

Uses memcpy with full amount of data.

memcpy(orig, copy4, size);

My benchmark using 2^16 values gave some surprising results:

1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]

The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.

Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?

This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well

The full code listing is below:

#include <arm_neon.h>
#include <vector>
#include <cinttypes>

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>

int main(int argc, char *argv[]) {

    int arr_size;
    if (argc==1)
    {
        std::cout << "Please enter an array size" << std::endl;
        exit(1);
    }

    int size =  atoi(argv[1]); // not very C++, sorry
    std::int32_t* orig = new std::int32_t[size];
    std::int32_t* copy = new std::int32_t[size];
    std::int32_t* copy2 = new std::int32_t[size];
    std::int32_t* copy3 = new std::int32_t[size];
    std::int32_t* copy4 = new std::int32_t[size];


    // Non-neon version
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; ++i)
    {
        copy[i] = orig[i];
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    // NEON version
    begin = std::chrono::steady_clock::now();
    int32x4_t tmp;
    for (int i = 0; i < size; i += 4)
    {
        tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
        vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
    }
    end = std::chrono::steady_clock::now();
    std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; i+=4)
    {
        memcpy(orig+i, copy3+i, 4);
    }
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    memcpy(orig, copy4, size);
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    return 0;
}

Upvotes: 0

Views: 2321

Answers (2)

Gregory Bolshakov
Gregory Bolshakov

Reputation: 11

If anyone is interested in real numbers for the fixed code:

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_DirectCopy/threads:1        100234 ns       102534 ns         7467
BM_AVX2/threads:1              127413 ns       125558 ns         5600
BM_MemcpyChunked/threads:1     123646 ns       122768 ns         5600
BM_Memcpy/threads:1             92502 ns        87891 ns         6400
enter code here

(Edited): I run it on Windows, 13th Gen Intel(R) Core(TM) i9-13980HX 2.20 GHz (32 X 2419 MHz CPU s)

CPU Caches:

L1 Data 48 KiB (x16)

L1 Instruction 32 KiB (x16)

L2 Unified 2048 KiB (x16)

L3 Unified 36864 KiB (x1)

MSVC v143 C++ 14 Standard

Upvotes: 1

Pascal Getreuer
Pascal Getreuer

Reputation: 3256

Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes).

Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig is not used after the last memcpy, so it eliminated the memcpy.

A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x) function prevent dead code elimination.

Upvotes: 8

Related Questions