Reputation: 1203

memcpy beats SIMD intrinsics

I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.

I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.

I have got four versions to copy data:

1. Baseline

Copies element by element:

for (int i = 0; i < size; ++i)
{
    copy[i] = orig[i];
}

2. NEON

This code loads four values into a temporary register, then copies the register to output.

Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.

int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
    tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
    vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
}

3. Stepped `memcpy`,

Uses the memcpy, but copies 4 elements at a time. This is to compare against the NEON version.

for (int i = 0; i < size; i+=4)
{
    memcpy(orig+i, copy3+i, 4);
}

4. Normal `memcpy`

Uses memcpy with full amount of data.

memcpy(orig, copy4, size);

My benchmark using 2^16 values gave some surprising results:

1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]

The speedup for NEON time is expected, however the faster stepped memcpy time is surprising to me. And the time for 4 even more so.

Why is memcpy doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?

This question discussed NEON versus memcpy(). However I don't feel the answers explore sufficently why the ARM memcpy implementation runs so well

The full code listing is below:

#include <arm_neon.h>
#include <vector>
#include <cinttypes>

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>

int main(int argc, char *argv[]) {

    int arr_size;
    if (argc==1)
    {
        std::cout << "Please enter an array size" << std::endl;
        exit(1);
    }

    int size =  atoi(argv[1]); // not very C++, sorry
    std::int32_t* orig = new std::int32_t[size];
    std::int32_t* copy = new std::int32_t[size];
    std::int32_t* copy2 = new std::int32_t[size];
    std::int32_t* copy3 = new std::int32_t[size];
    std::int32_t* copy4 = new std::int32_t[size];


    // Non-neon version
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; ++i)
    {
        copy[i] = orig[i];
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    // NEON version
    begin = std::chrono::steady_clock::now();
    int32x4_t tmp;
    for (int i = 0; i < size; i += 4)
    {
        tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
        vst1q_s32(&copy2[i], tmp); // copy 4 elements from tmp SIMD register
    }
    end = std::chrono::steady_clock::now();
    std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    for (int i = 0; i < size; i+=4)
    {
        memcpy(orig+i, copy3+i, 4);
    }
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;


    // Memcpy example
    begin = std::chrono::steady_clock::now();
    memcpy(orig, copy4, size);
    end = std::chrono::steady_clock::now();
    std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;

    return 0;
}

Upvotes: 0

memcpy beats SIMD intrinsics

1. Baseline

2. NEON

3. Stepped `memcpy`,

4. Normal `memcpy`

Answers (2)

Related Questions

memcpy beats SIMD intrinsics

1. Baseline

2. NEON

3. Stepped memcpy,

4. Normal memcpy

Answers (2)

Related Questions

3. Stepped `memcpy`,

4. Normal `memcpy`