Reputation: 1203
I have been looking at fast ways to copy various amounts of data, when NEON vector instructions are available on an ARM device.
I've done some benchmarks, and have some interesting results. I'm trying to understand what I'm looking at.
I have got four versions to copy data:
Copies element by element:
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
This code loads four values into a temporary register, then copies the register to output.
Thus the number of loads are reduced by half. There may be a way to skip the temporary register and reduce the loads by one quarter, but I haven't found a way.
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(©2[i], tmp); // copy 4 elements from tmp SIMD register
}
memcpy
,Uses the memcpy
, but copies 4 elements at a time. This is to compare against the NEON version.
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
memcpy
Uses memcpy
with full amount of data.
memcpy(orig, copy4, size);
My benchmark using 2^16
values gave some surprising results:
1. Baseline time = 3443[µs]
2. NEON time = 1682[µs]
3. memcpy (stepped) time = 1445[µs]
4. memcpy time = 81[µs]
The speedup for NEON time is expected, however the faster stepped memcpy
time is surprising to me. And the time for 4
even more so.
Why is memcpy
doing so well? Does it use NEON under-the-hood? Or are there efficient memory copy instructions I am not aware of?
This question discussed NEON versus memcpy()
. However I don't feel the answers explore sufficently why the ARM memcpy
implementation runs so well
The full code listing is below:
#include <arm_neon.h>
#include <vector>
#include <cinttypes>
#include <iostream>
#include <cstdlib>
#include <chrono>
#include <cstring>
int main(int argc, char *argv[]) {
int arr_size;
if (argc==1)
{
std::cout << "Please enter an array size" << std::endl;
exit(1);
}
int size = atoi(argv[1]); // not very C++, sorry
std::int32_t* orig = new std::int32_t[size];
std::int32_t* copy = new std::int32_t[size];
std::int32_t* copy2 = new std::int32_t[size];
std::int32_t* copy3 = new std::int32_t[size];
std::int32_t* copy4 = new std::int32_t[size];
// Non-neon version
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; ++i)
{
copy[i] = orig[i];
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Baseline time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// NEON version
begin = std::chrono::steady_clock::now();
int32x4_t tmp;
for (int i = 0; i < size; i += 4)
{
tmp = vld1q_s32(orig + i); // load 4 elements to tmp SIMD register
vst1q_s32(©2[i], tmp); // copy 4 elements from tmp SIMD register
}
end = std::chrono::steady_clock::now();
std::cout << "NEON time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
for (int i = 0; i < size; i+=4)
{
memcpy(orig+i, copy3+i, 4);
}
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
// Memcpy example
begin = std::chrono::steady_clock::now();
memcpy(orig, copy4, size);
end = std::chrono::steady_clock::now();
std::cout << "memcpy time = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
return 0;
}
Upvotes: 0
Views: 2321
Reputation: 11
If anyone is interested in real numbers for the fixed code:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_DirectCopy/threads:1 100234 ns 102534 ns 7467
BM_AVX2/threads:1 127413 ns 125558 ns 5600
BM_MemcpyChunked/threads:1 123646 ns 122768 ns 5600
BM_Memcpy/threads:1 92502 ns 87891 ns 6400
enter code here
(Edited): I run it on Windows, 13th Gen Intel(R) Core(TM) i9-13980HX 2.20 GHz (32 X 2419 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
MSVC v143 C++ 14 Standard
Upvotes: 1
Reputation: 3256
Note: this code uses memcpy in the wrong direction. It should be memcpy(dest, src, num_bytes)
.
Because the "normal memcpy" test happens last, the massive order of magnitude speedup vs. other tests would be explained by dead code elimination. The optimizer saw that orig
is not used after the last memcpy, so it eliminated the memcpy.
A good way to write reliable benchmarks is with the Benchmark framework, and use their benchmark::DoNotOptimize(x)
function prevent dead code elimination.
Upvotes: 8